0% found this document useful (0 votes)
11 views11 pages

2 Mark Material

The document provides a comprehensive overview of the fundamentals of data science and analytics, covering key concepts such as data types, data cleaning, exploratory data analysis, and various statistical methods. It includes definitions, explanations, and examples related to structured and unstructured data, descriptive and inferential statistics, and predictive analytics techniques like regression. The content is structured into units, each addressing different aspects of data science, making it a valuable resource for students in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

2 Mark Material

The document provides a comprehensive overview of the fundamentals of data science and analytics, covering key concepts such as data types, data cleaning, exploratory data analysis, and various statistical methods. It includes definitions, explanations, and examples related to structured and unstructured data, descriptive and inferential statistics, and predictive analytics techniques like regression. The content is structured into units, each addressing different aspects of data science, making it a valuable resource for students in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

2 MARK Material

Fundamentals of Data Science and Analysis (Anna


University)

Scan to open on Studocu

Downloaded by Saravanan Sujatha


Studocu is not sponsored or endorsed by any college or university

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT I - INTRODUCTION TO DATA SCIENCE

Q.1 What is data science ?


Ans. :
Data science is an interdisciplinary field that seeks to extract knowledge or insights from various
forms of data.
• At its core, data science aims to discover and extract actionable knowledge from data
that can be used to make sound business decisions and predictions.
• Data science uses advanced analytical theory and various methods such as time
series analysis for predicting future.

Q.2 Define structured data.


Ans. : Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing structured
data. The term structured data refers to data that is identifiable because it is organized in a
structure.

Q.3 What is data ?


Ans. : Data set is collection of related records or information. The information may be on
some entity or some subject area.

0.4 What is unstructured data?


Ans.: Unstructured data is data that does not follow a specified format. Row and columns are
not used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure.

Q.5 What is machine-generated data ?


Ans. : Machine-generated data is an information that is created without human interaction as
a result of a computer process or application activity. This means that data entered manually
by an end-user is not recognized to be machine-generated.

Q.6 Define streaming data.


Ans. : Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).

Q.7 List the stages of data science process.


Ans. : Stages of data science process are as follows :
• Discovery or setting the research goal
• Retrieving data
• Data preparation
• Data exploration
• Data modeling
• Presentation and automation.

0.8 What are the advantages of data repositories ?


Ans.: Advantages are as follows :
• Data is preserved and archived.
• Data isolation allows for easier and faster data reporting.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

• Database administrators have easier time tracking problems.


• There is value to storing and analyzing data.

Q.9 What is data cleaning ?


Ans.: Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data.

Q.10 What is outlier detection ?


Ans.: Outlier detection is the process of detecting and subsequently excluding outliers from a
given set of data. The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.

Q.11 Explain exploratory data analysis.


Ans.: Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of
simple summary statistics and graphic visualizations in order to gain a deeper understanding
of data. EDA is used by data scientists to analyze and investigate data sets and summarize
their main characteristics, often employing data visualization methods.

Q.12 What is data cleaning ?


Ans.: Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data.

Q.13 List the stages of data science process.


Ans. : Data science process consists of six stages :
1. Discovery or setting the research goal. 2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation.

Q.14 What is data repository ?


Ans. : Data repository is also known as a data library or data archive. This is a general term to
refer to a data set isolated to be mined for data reporting and analysis. The data repository is a
large database infrastructure, several databases that collect, manage and store data sets for
data analysis, sharing and reporting.

Q.15 List the data cleaning tasks ?


Ans. : Data cleaning are as follows :
1. Data acquisition and metadata
3. Unified date format
5. Identify outliers and smooth out noisy data
2. Fill in missing values
4. Converting nominal to numeric
6. Correct inconsistent data.

Q.16 What is Euclidean distance ?


Ans. : Euclidean distance is used to measure the similarity between observations. It is calculated
as the square root of the sum of differences between each point.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT II - DESCRIPTIVE ANALYTICS

Q.1 Define qualitative data.


Ans. : Qualitative data provides information about the quality of an object or information
which cannot be measured. Qualitative data cannot be expressed as a number. Data that
represent nominal scales such as gender, economic status and religious preference are usually
considered to be qualitative data. It is also called categorical data.

Q.2 What is quantitative data ?


Ans.: Quantitative data is the one that focuses on numbers and mathematical calculations and
can be calculated and computed. Quantitative data are anything that can be expressed as a
number or quantified. Examples of quantitative data are scores on achievement tests, number
of hours of study or weight of a subject.

Q.3 What is nominal data ?


Ans. : A nominal data is the 1" level of measurement scale in which the numbers serve as
"tags" or "labels" to classify or identify the objects. Nominal data is type of qualitative data.
A nominal data usually deals with the non-numeric variables or the numbers that do not have
any value. While developing statistical models, nominal data are usually transformed before
building the model.

Q.4 Describe ordinal data.


Ans.: Ordinal data is a variable in which the value of the data is captured from an ordered set,
which is recorded in the order of magnitude. Ordinal represents the "order." Ordinal data is
known as qualitative data or categorical data. It can be grouped, named and also ranked.

Q.5 What is an interval data ?


Ans. : Interval data corresponds to a variable in which the value is chosen from an interval
set. It is defined as a quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact manner, not as
in a relative way in which the presence of zero is arbitrary.

Q.6 What is frequency distribution ?


Ans. : Frequency distribution is a representation, either in a graphical or tabular format that
displays the number of observations within a given interval. The interval size depends on the
data being analyzed and the goals of the analyst.

Q.7 What is cumulative frequency ?


Ans. : A cumulative frequency distribution can be useful for ordered data (e.g. data arranged
in intervals, measurement data, etc.). Instead of reporting frequencies, the recorded values are
the sum of all frequencies for values less than and including the current value.

Q.8 Explain histogram.


Ans. : A histogram is a special kind of bar graph that applies to quantitative data (discrete oF
continuous). The horizontal axis represents the range of data values. The bar height
represents the frequency of data values falling within the interval formed by the width of the
bar. The bars are also pushed together with no spaces between them.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

Q.9 What is goal of variability ?


Ans. : The goal for variability is to obtain a measure of how spread out the scores are in a
distribution. A measure of variability usually accompanies a measure of central tendency as
basic descriptive statistics for a set of scores.

Q.10 How to calculate range ?


Ans.: The range is the total distance covered by the distribution, from the highest score to the
lowest score (using the upper and lower real limits of the range).
Range = Maximum value - Minimum value

Q.11 What is an Independent variables ?


Ans. : An independent variable is the variable that is changed or controlled in a scientific
experiment to test the effects on the dependent variable.

Q.12 Explain frequency polygon.


Ans. : Frequency polygons are a graphical device for understanding the shapes of
distributions. They serve the same purpose as histograms, but are especially helpful for
comparing sets of data. Frequency polygons are also a good choice for displaying cumulative
frequency distributions.

Q.13 What is stem and leaf diagram ?


Ans. : Stem and leaf diagrams allow to display raw data visually. Each raw score is divided
into a stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the
remaining digits of the raw value. Data points are split into a leaf (usually the ones digit) and
a stem (the other digits).

Q.14 What is correlation ?


Ans.: Correlation refers to a relationship between two or more objects. In statistics, the word
correlation refers to the relationship between two variables. Correlation exists between two
variables when one of them is related to the other in some way.

Q.15 Define positive and negative correlation.


Ans. :
Positive correlation: Association between variables such that high scores on one variable
tends to have high scores on the other variable. A direct relation between the variables.
Negative correlation: Association between variables such that high scores on one variable
tends to have low scores on the other variable. An inverse relation between the variables.

Q.16 What is cause and effect relationship ?


Ans. : If two variables vary in such a way that movement in one are accompanied by
movement in other, these variables are called cause and effect relationship.

17 Explain advantages of scatter diagram.


Ans. :
1. It is a simple to implement and attractive method to find out the nature of
correlation.
2. It is easy to understand.
3. User will get rough idea about correlation (positive or negative correlation).

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

4. Not influenced by the size of extreme item


5. First step in investing the relationship between two variables.

Q.18 What is regression problem ?


Ans.: For an input x, if the output is continuous, this is called a regression problem.

Q.19 What are assumptions of regression ?


Ans.: The regression has five key assumptions : Linear relationship, Multivariate normality, No or
little multi-collinearity and No auto-correlation.

Q.20 What is regression analysis used for ?


Ans.: Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding the causal effect.
relationship between the variables.

Q.21 What are the types of regressions ?


Ans.: Types of regression are linear regression, logistic regression, polynomial regression,
stepwise regression, ridge regression, lasso regression and elastic-net regression.

Q.22 What do you mean by least square method ?


Ans. : Least squares is a statistical method used to determine a line of best fit by minimizing the
sum of squares created by a mathematical function. A "square" is determined by squaring the
distance between a data point and the regression line or mean value of the data set.

Q.23 What is correlation analysis ?


Ans. : Correlation is a statistical analysis used to measure and describe the relationship between
two variables. A correlation plot will display correlations between the values. of variables in
the dataset. If two variables are correlated, X and Y then a regression can be done in order to
predict scores on Y from the scores on X.

Q.24 What is multiple regression equations ?


Ans. : Multiple linear regression is an extension of linear regression, which allows a response
variable, y to be modelled as a linear function of two or more predictor variables. In a
multiple regression model, two or more independent variables, i.e. predictors are involved in
the model.
The simple linear regression model and the multiple regression model assume that the dependent
variable is continuous.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT III - INFERENTIAL STATISTICS

Q.1 Define population.


Ans.: Population is a collection of objects. It may be finite or infinite according to the number
of objects in the population.

Q.2 What is sample ?


Ans. : A sample is a group of units selected from a larger group (the population). By studying
the sample it is hoped to draw valid conclusions about the larger group. A sample is a subset
of a population. Sample is a smaller group, the part of the population of interest that we
actually examine in order to gather the information.

Q.3 What is sampling distribution ?


Ans. : A sampling distribution is a probability distribution of a statistic obtained from a larger
number of samples drawn from a specific population. The sampling distribution of a given
population is the distribution of frequencies of a range of different outcomes that could
possibly occur for a statistic of a population.

Q.4 What is use of standard error of the mean?


Ans. : The standard error of the mean (SEM) is used to determine the differences between
more than one sample of data. It helps to estimate how well a sample data represents the
whole population by measuring the accuracy with which the sample data represents a
population using standard deviation.

Q.5 What is inferential statistics ?


Ans. : The null and alternate hypothesis statements are important parts of the analytical
methods collectively known as inferential statistics. Inferential statistics are methods used to
determine something about a population, based on the observation of a sample.

Q.6 What is sampling error ?


Ans. : The difference between the point estimate and the actual population parameter value is
called the sampling error.

Q.7 What is point estimate ?


Ans. : A point estimate is a single value estimate for a population parameter. The most
unbiased point estimate of the population mean and is the sample mean. A point estimate is a
single numerical value used to estimate the corresponding population parameter.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT IV - ANALYSIS OF VARIANCE

Q.1 What is one sided test ?


Ans. : A one-sided test is a statistical hypothesis test in which the values for which we can
reject the null hypothesis, H, are located entirely in one tail of the probability distribution.

Q.2 What is p-value ?


Ans.: • The probability of observing a value as extreme as or more than the observed value.
• p-value is the probability to observe a value of the test statistic as extreme as the one
observed, if the null hypothesis is true. So small p-value indicates that the null hypothesis is
not true and hence should be rejected.

Q.3 Define estimator.


Ans. : The procedure or rule to determine an unknown population parameter is called an
estimator.

Q.4 When type II error occurs ?


Ans. : A type II error occurs when the sample does not appear to have been affected by the
treatment when, in fact, the treatment does have an effect. In this case, the researcher will fail
to reject the null hypothesis and falsely conclude that the treatment does not have an effect.

Q.5 What is two-sided test ?


Ans. : A hypothesis test which is designed to identify a difference from a hypothesized value
in either direction is called a two-sided test. A two-sided test is a statistical hypothesis test in
which the values for which we can reject the null hypothesis, Ho are located in both tails of
the probability distribution.

Q.6 What is difference between estimator and parameter ?


Ans. :

Q.7 What is goodness-of-fit test ?


Ans. : A goodness-of-fit test is an inferential procedure used to determine whether a
frequency distribution follows a claimed distribution.

Q.8 List the strength of Chi-Square Test.


Ans.: Strength :
1. It is easier to compute than some statistics.
2. Chi-square makes no assumptions about the distribution of the population.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT V - PREDICTIVE ANALYTICS

Q.1 What is logistic regression ?


Ans. : Logistic regression is a form of regression analysis in which the outcome variable is binary
or dichotomous. A statistical method used to model dichotomous or binary outcomes using
predictor variables. Logistic regression is one of the supervised learning algorithms.

Q.2 What is omnibus test ?


Ans. : The omnibus test is a likelihood-ratio chi-square test of the current model versus the null
model. The significance value of less than 0.05 indicates that the current model outperforms
the null model. Omnibus tests are generic statistical tests used for checking whether the
variance explained by the model is more than the unexplained variance.

Q.3 Define serial correlation.


Ans. : Serial correlation is the relationship between a given variable and a lagged version of itself
over various time intervals. It measures the relationship between a variable's current value
given its past values.

Q.4 What are the consequences of serial correlation ?


Ans.: 1. Pure serial correlation does not cause bias in the regression coefficient estimates.
1. Serial correlation causes OLS to no longer be a minimum variance estimator.
2. Serial correlation causes the estimated variances of the regression coefficients to be
biased, leading to unreliable hypothesis testing.

Q.5 Define autocorrelation.


Ans. : Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals. It measures how the lagged version of the value of a variable is
related to the original version of it in a time series.

Q.6 What are reasons for censoring ?


Ans.: There are generally three reasons why censoring might occur :
• A subject does not experience the event before the study ends.
• A person is lost to follow-up during the study period.
• A person withdraws from the study.

Q.7 Explain regression using statsmodels.


Ans. : Linear regression statsmodel is the model that helps us to predict and is used for fitting up
the scenario where one parameter is directly dependent on the other parameter. Here, we have
one variable that is dependent and the other one which is independent. Depending on the
change in the value of the independent parameter, we need to predict the change in the
dependent variable.

Q.8 Why residual analysis is important ?


Ans. : Residual (error) analysis is important to check whether the assumptions of regression
models have been satisfied. It is performed to check the following :
1. The residuals are normally distributed.
2. The variance of residual is constant (homoscedasticity).
3. The functional form of regression is correctly specified.
4. If there are any outliers.

Downloaded by Saravanan Sujatha


SRM TRP Engineering College
Department of Artificial Intelligence and Data Science
AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

Q.9 What is spurious regression ?


Ans.: The regression is spurious when we regress one random walk onto another independent
random walk. It is spurious because the regression will most likely indicate a non-existing
relationship.

Downloaded by Saravanan Sujatha

You might also like