Unit 1
Unit 1
Data
• Data encompasses a collection of discrete objects, numbers, words,
events, facts, measurements, observations, or even descriptions of
things. Such data is collected and stored by every event or process
occurring in several disciplines, including biology, economics,
engineering, marketing, and others.
Information
• Processing data elicits useful information and processing such
information generates useful knowledge
EDA
• Exploratory Data Analysis is a process of examining the available dataset to
discover patterns, spot anomalies, test hypotheses, and check assumptions using statistical
measures.
• For example, an application tracking the sleeping pattern of patients suffering from dementia
requires several types of sensors' data storage, such as sleep data, heart rate from the patient,
electro-dermal activities, and user activities pattern. All of these data points are required to
correctly diagnose the mental state of the person. Hence, these are mandatory requirements for the
application. In addition to this, it is required to categorize the data, numerical or categorical, and
the format of storage and dissemination.
• Data collection: Data collected from several sources must be stored in the correct format and
transferred to the right information technology personnel within a company. As mentioned
previously, data can be collected from several objects on several events using different types of
sensors and storage tools.
• Data processing: Preprocessing involves the process of pre-curating the dataset before
actual analysis. Common tasks involve correctly exporting the dataset, placing them
under the right tables, structuring them, and exporting them in the correct format.
• Data cleaning: Preprocessed data is still not ready for detailed analysis. It must be
correctly transformed for an incompleteness check, duplicates check, error check, and
missing value check. These tasks are performed in the data cleaning stage, which
involves responsibilities such as matching the correct record, finding inaccuracies in
the dataset, understanding the overall data quality, removing duplicate items, and
filling in the missing values.
• EDA: Exploratory data analysis, is the stage where we actually start to understand the message contained in the
data. It should be noted that several types of data transformation techniques might be required during the process
of exploration.
• Modeling and algorithm: From a data science perspective, generalized models or mathematical formulas can
represent or exhibit relationships among different variables, such as correlation or causation. These models or
equations involve one or more variables that depend on other variables to cause an event.
• For example, when buying, say, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the number of
pens bought (Quantity). Hence, our model would be Total = UnitPrice * Quantity. Here, the total price is
dependent on the unit price. Hence, the total price is referred to as the dependent variable and the unit price is
referred to as an independent variable. In general, a model always describes the relationship between independent
and dependent variables. Inferential statistics deals with quantifying relationships between particular variables.
• The Judd model for describing the relationship between data, model, and error still holds true: Data = Model +
Error.
• Data Product: Any computer software that uses data as inputs, produces outputs, and
provides feedback based on the output to control the environment is referred to as a data
product. A data product is generally based on a model developed during data analysis, for
example, a recommendation model that inputs user purchase history and recommends a
related item that the user is highly likely to buy.
• Communication: This stage deals with disseminating the results to end stakeholders to use
the result for business intelligence. One of the most notable steps in this stage is data
visualization. Visualization deals with information relay techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.
•
1.3 The significance of EDA
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood pressure,
heart rate, temperature, number of teeth, number of bones, and the number of family members. This data is
often referred to as quantitative data in statistics. The numerical dataset can be either discrete or continuous
types.
• Discrete data
• This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of
heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a discrete
dataset is referred to as a discrete variable. The discrete variable takes a fixed number of distinct values. For
example, the Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed. The Rank
variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
• Continuous data
• A variable that can have an infinite number of numerical values within
a specific range is classified as continuous data. A variable describing
continuous data is a continuous
• variable. For example, what is the temperature of your city today? Can
we be finite? Similarly, the weight variable in the previous section is a
continuous variable.
• Continuous data
• A variable that can have an infinite number of numerical values within
a specific range is classified as continuous data. A variable describing
continuous data is a continuous variable.
• For example, what is the temperature of your city today? Can we be
finite.
• the weight variable in the previous section is a continuous variable.
We are
Categorical data
• Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type
of address, or categories of the movies. This data is often referred to as qualitative datasets in
statistics. To understand clearly, here are some of the most common types of categorical data you can
find in data:
Gender (Male, Female, Other, or Unknown)
Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married,
Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical,
Horror, Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction,
Social, Thriller, Urban, or Western)
Blood type (A, B, AB, or OTypes of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives,
Opioids, Inhalants, or Cannabis)
• A variable describing categorical data is referred to as a categorical variable
• A binary categorical variable can take exactly two values and is also referred to as a dichotomous variable.
For example, when you create an experiment, the result is either success or failure. Hence, results can be
understood as a binary categorical variable.
• Polytomous variables are categorical variables that can take more than two
possible values
Measurement scales
•Nominal
•Ordinal
•Interval
•Ratio
Nominal
• These are practiced for labeling variables without any quantitative value. The scales are generally referred to as
labels. And these scales are mutually exclusive and do not carry any numerical importance.
• Nominal scales are considered qualitative scales and the measurements that are taken using qualitative scales are
considered qualitative data.
• Examples
• Biological species
• Ratio scales contain order, exact values, and absolute zero, which
makes it possible to be used in descriptive and inferential statistics.
These scales provide numerous possibilities for statistical analysis.
• Mathematical operations, the measure of central tendencies, and
the measure of dispersion and coefficient of variation can also be
computed from such scales.
• Examples include a measure of energy, mass, length, duration,
electrical energy, plan angle, and volume
A summary of the data types and scale measures:
1.5 Comparing EDA with classical and Bayesian analysis
• Classical data analysis: For the classical data
analysis approach, the problem definition and
data collection step are followed by model
development, which is followed by analysis and
result communication.
• Exploratory data analysis approach: For the EDA
approach, it follows the same approach as classical
data analysis except the model imposition and the data
analysis steps are swapped. The main focus is on the
data, its structure, outliers, models, and visualizations.
Generally, in EDA, we do not impose any
deterministic or probabilistic models on the data.
• Bayesian data analysis approach: The Bayesian
approach incorporates prior probability distribution
knowledge into the analysis steps as shown in the
diagram.
• Prior probability, in Bayesian statistics, is the probability of an
event before new data is collected. This is the best rational
assessment of the probability of an outcome based on the
current knowledge before an experiment is performed.
1.6 Software tools available for EDA
print(my2DArray.data)
print(my2DArray.shape)
print(my2DArray.dtype)
print(my2DArray.strides)
4. For creating an array using built-in NumPy functions, we will use the following code
# Array of ones
ones = np.ones((3,4))
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
# Array with random values
np.random.random((2,2))
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
# Full array
fullArray = np.full((2,2),7)
5. For NumPy arrays and file operations, we will use the
following code:
# Save a numpy array into file
x = np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')
# Loading numpy array from text
z = np.loadtxt('data.out', unpack=True)
print(z)
# Loading numpy array using genfromtxt method
my_array2 = np.genfromtxt('data.out',
skip_header=1,
filling_values=-999)
print(my_array2)
6. For inspecting NumPy arrays, we will use the following
code:
SciPy library.
Matplotlib
Matplotlib provides a huge library of customizable plots, along with a comprehensive set
of
backends. It can be utilized to create professional reporting applications, interactive
analytical applications, complex dashboard applications, web/GUI applications,
embedded
views, and many more.