The document provides a comprehensive overview of the fundamentals of data science and analytics, covering key concepts such as data types, data cleaning, exploratory data analysis, and various statistical methods. It includes definitions, explanations, and examples related to structured and unstructured data, descriptive and inferential statistics, and predictive analytics techniques like regression. The content is structured into units, each addressing different aspects of data science, making it a valuable resource for students in the field.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
11 views11 pages
2 Mark Material
The document provides a comprehensive overview of the fundamentals of data science and analytics, covering key concepts such as data types, data cleaning, exploratory data analysis, and various statistical methods. It includes definitions, explanations, and examples related to structured and unstructured data, descriptive and inferential statistics, and predictive analytics techniques like regression. The content is structured into units, each addressing different aspects of data science, making it a valuable resource for students in the field.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11
2 MARK Material
Fundamentals of Data Science and Analysis (Anna
University)
Scan to open on Studocu
Downloaded by Saravanan Sujatha
Studocu is not sponsored or endorsed by any college or university
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
UNIT I - INTRODUCTION TO DATA SCIENCE
Q.1 What is data science ?
Ans. : Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms of data. • At its core, data science aims to discover and extract actionable knowledge from data that can be used to make sound business decisions and predictions. • Data science uses advanced analytical theory and various methods such as time series analysis for predicting future.
Q.2 Define structured data.
Ans. : Structured data is arranged in rows and column format. It helps for application to retrieve and process data easily. Database management system is used for storing structured data. The term structured data refers to data that is identifiable because it is organized in a structure.
Q.3 What is data ?
Ans. : Data set is collection of related records or information. The information may be on some entity or some subject area.
0.4 What is unstructured data?
Ans.: Unstructured data is data that does not follow a specified format. Row and columns are not used for unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no identifiable structure.
Q.5 What is machine-generated data ?
Ans. : Machine-generated data is an information that is created without human interaction as a result of a computer process or application activity. This means that data entered manually by an end-user is not recognized to be machine-generated.
Q.6 Define streaming data.
Ans. : Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously and in small sizes (order of Kilobytes).
Q.7 List the stages of data science process.
Ans. : Stages of data science process are as follows : • Discovery or setting the research goal • Retrieving data • Data preparation • Data exploration • Data modeling • Presentation and automation.
0.8 What are the advantages of data repositories ?
Ans.: Advantages are as follows : • Data is preserved and archived. • Data isolation allows for easier and faster data reporting.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
• Database administrators have easier time tracking problems.
• There is value to storing and analyzing data.
Q.9 What is data cleaning ?
Ans.: Data cleaning means removing the inconsistent data or noise and collecting necessary information of a collection of interrelated data.
Q.10 What is outlier detection ?
Ans.: Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data. The easiest way to find outliers is to use a plot or a table with the minimum and maximum values.
Q.11 Explain exploratory data analysis.
Ans.: Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple summary statistics and graphic visualizations in order to gain a deeper understanding of data. EDA is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.
Q.12 What is data cleaning ?
Ans.: Data cleaning means removing the inconsistent data or noise and collecting necessary information of a collection of interrelated data.
Q.13 List the stages of data science process.
Ans. : Data science process consists of six stages : 1. Discovery or setting the research goal. 2. Retrieving data 3. Data preparation 4. Data exploration 5. Data modeling 6. Presentation and automation.
Q.14 What is data repository ?
Ans. : Data repository is also known as a data library or data archive. This is a general term to refer to a data set isolated to be mined for data reporting and analysis. The data repository is a large database infrastructure, several databases that collect, manage and store data sets for data analysis, sharing and reporting.
Q.15 List the data cleaning tasks ?
Ans. : Data cleaning are as follows : 1. Data acquisition and metadata 3. Unified date format 5. Identify outliers and smooth out noisy data 2. Fill in missing values 4. Converting nominal to numeric 6. Correct inconsistent data.
Q.16 What is Euclidean distance ?
Ans. : Euclidean distance is used to measure the similarity between observations. It is calculated as the square root of the sum of differences between each point.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
UNIT II - DESCRIPTIVE ANALYTICS
Q.1 Define qualitative data.
Ans. : Qualitative data provides information about the quality of an object or information which cannot be measured. Qualitative data cannot be expressed as a number. Data that represent nominal scales such as gender, economic status and religious preference are usually considered to be qualitative data. It is also called categorical data.
Q.2 What is quantitative data ?
Ans.: Quantitative data is the one that focuses on numbers and mathematical calculations and can be calculated and computed. Quantitative data are anything that can be expressed as a number or quantified. Examples of quantitative data are scores on achievement tests, number of hours of study or weight of a subject.
Q.3 What is nominal data ?
Ans. : A nominal data is the 1" level of measurement scale in which the numbers serve as "tags" or "labels" to classify or identify the objects. Nominal data is type of qualitative data. A nominal data usually deals with the non-numeric variables or the numbers that do not have any value. While developing statistical models, nominal data are usually transformed before building the model.
Q.4 Describe ordinal data.
Ans.: Ordinal data is a variable in which the value of the data is captured from an ordered set, which is recorded in the order of magnitude. Ordinal represents the "order." Ordinal data is known as qualitative data or categorical data. It can be grouped, named and also ranked.
Q.5 What is an interval data ?
Ans. : Interval data corresponds to a variable in which the value is chosen from an interval set. It is defined as a quantitative measurement scale in which the difference between the two variables is meaningful. In other words, the variables are measured in an exact manner, not as in a relative way in which the presence of zero is arbitrary.
Q.6 What is frequency distribution ?
Ans. : Frequency distribution is a representation, either in a graphical or tabular format that displays the number of observations within a given interval. The interval size depends on the data being analyzed and the goals of the analyst.
Q.7 What is cumulative frequency ?
Ans. : A cumulative frequency distribution can be useful for ordered data (e.g. data arranged in intervals, measurement data, etc.). Instead of reporting frequencies, the recorded values are the sum of all frequencies for values less than and including the current value.
Q.8 Explain histogram.
Ans. : A histogram is a special kind of bar graph that applies to quantitative data (discrete oF continuous). The horizontal axis represents the range of data values. The bar height represents the frequency of data values falling within the interval formed by the width of the bar. The bars are also pushed together with no spaces between them.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
Q.9 What is goal of variability ?
Ans. : The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability usually accompanies a measure of central tendency as basic descriptive statistics for a set of scores.
Q.10 How to calculate range ?
Ans.: The range is the total distance covered by the distribution, from the highest score to the lowest score (using the upper and lower real limits of the range). Range = Maximum value - Minimum value
Q.11 What is an Independent variables ?
Ans. : An independent variable is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable.
Q.12 Explain frequency polygon.
Ans. : Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful for comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions.
Q.13 What is stem and leaf diagram ?
Ans. : Stem and leaf diagrams allow to display raw data visually. Each raw score is divided into a stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the remaining digits of the raw value. Data points are split into a leaf (usually the ones digit) and a stem (the other digits).
Q.14 What is correlation ?
Ans.: Correlation refers to a relationship between two or more objects. In statistics, the word correlation refers to the relationship between two variables. Correlation exists between two variables when one of them is related to the other in some way.
Q.15 Define positive and negative correlation.
Ans. : Positive correlation: Association between variables such that high scores on one variable tends to have high scores on the other variable. A direct relation between the variables. Negative correlation: Association between variables such that high scores on one variable tends to have low scores on the other variable. An inverse relation between the variables.
Q.16 What is cause and effect relationship ?
Ans. : If two variables vary in such a way that movement in one are accompanied by movement in other, these variables are called cause and effect relationship.
17 Explain advantages of scatter diagram.
Ans. : 1. It is a simple to implement and attractive method to find out the nature of correlation. 2. It is easy to understand. 3. User will get rough idea about correlation (positive or negative correlation).
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
4. Not influenced by the size of extreme item
5. First step in investing the relationship between two variables.
Q.18 What is regression problem ?
Ans.: For an input x, if the output is continuous, this is called a regression problem.
Q.19 What are assumptions of regression ?
Ans.: The regression has five key assumptions : Linear relationship, Multivariate normality, No or little multi-collinearity and No auto-correlation.
Q.20 What is regression analysis used for ?
Ans.: Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect. relationship between the variables.
Q.21 What are the types of regressions ?
Ans.: Types of regression are linear regression, logistic regression, polynomial regression, stepwise regression, ridge regression, lasso regression and elastic-net regression.
Q.22 What do you mean by least square method ?
Ans. : Least squares is a statistical method used to determine a line of best fit by minimizing the sum of squares created by a mathematical function. A "square" is determined by squaring the distance between a data point and the regression line or mean value of the data set.
Q.23 What is correlation analysis ?
Ans. : Correlation is a statistical analysis used to measure and describe the relationship between two variables. A correlation plot will display correlations between the values. of variables in the dataset. If two variables are correlated, X and Y then a regression can be done in order to predict scores on Y from the scores on X.
Q.24 What is multiple regression equations ?
Ans. : Multiple linear regression is an extension of linear regression, which allows a response variable, y to be modelled as a linear function of two or more predictor variables. In a multiple regression model, two or more independent variables, i.e. predictors are involved in the model. The simple linear regression model and the multiple regression model assume that the dependent variable is continuous.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
UNIT III - INFERENTIAL STATISTICS
Q.1 Define population.
Ans.: Population is a collection of objects. It may be finite or infinite according to the number of objects in the population.
Q.2 What is sample ?
Ans. : A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group. A sample is a subset of a population. Sample is a smaller group, the part of the population of interest that we actually examine in order to gather the information.
Q.3 What is sampling distribution ?
Ans. : A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
Q.4 What is use of standard error of the mean?
Ans. : The standard error of the mean (SEM) is used to determine the differences between more than one sample of data. It helps to estimate how well a sample data represents the whole population by measuring the accuracy with which the sample data represents a population using standard deviation.
Q.5 What is inferential statistics ?
Ans. : The null and alternate hypothesis statements are important parts of the analytical methods collectively known as inferential statistics. Inferential statistics are methods used to determine something about a population, based on the observation of a sample.
Q.6 What is sampling error ?
Ans. : The difference between the point estimate and the actual population parameter value is called the sampling error.
Q.7 What is point estimate ?
Ans. : A point estimate is a single value estimate for a population parameter. The most unbiased point estimate of the population mean and is the sample mean. A point estimate is a single numerical value used to estimate the corresponding population parameter.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
UNIT IV - ANALYSIS OF VARIANCE
Q.1 What is one sided test ?
Ans. : A one-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H, are located entirely in one tail of the probability distribution.
Q.2 What is p-value ?
Ans.: • The probability of observing a value as extreme as or more than the observed value. • p-value is the probability to observe a value of the test statistic as extreme as the one observed, if the null hypothesis is true. So small p-value indicates that the null hypothesis is not true and hence should be rejected.
Q.3 Define estimator.
Ans. : The procedure or rule to determine an unknown population parameter is called an estimator.
Q.4 When type II error occurs ?
Ans. : A type II error occurs when the sample does not appear to have been affected by the treatment when, in fact, the treatment does have an effect. In this case, the researcher will fail to reject the null hypothesis and falsely conclude that the treatment does not have an effect.
Q.5 What is two-sided test ?
Ans. : A hypothesis test which is designed to identify a difference from a hypothesized value in either direction is called a two-sided test. A two-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, Ho are located in both tails of the probability distribution.
Q.6 What is difference between estimator and parameter ?
Ans. :
Q.7 What is goodness-of-fit test ?
Ans. : A goodness-of-fit test is an inferential procedure used to determine whether a frequency distribution follows a claimed distribution.
Q.8 List the strength of Chi-Square Test.
Ans.: Strength : 1. It is easier to compute than some statistics. 2. Chi-square makes no assumptions about the distribution of the population.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
UNIT V - PREDICTIVE ANALYTICS
Q.1 What is logistic regression ?
Ans. : Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous. A statistical method used to model dichotomous or binary outcomes using predictor variables. Logistic regression is one of the supervised learning algorithms.
Q.2 What is omnibus test ?
Ans. : The omnibus test is a likelihood-ratio chi-square test of the current model versus the null model. The significance value of less than 0.05 indicates that the current model outperforms the null model. Omnibus tests are generic statistical tests used for checking whether the variance explained by the model is more than the unexplained variance.
Q.3 Define serial correlation.
Ans. : Serial correlation is the relationship between a given variable and a lagged version of itself over various time intervals. It measures the relationship between a variable's current value given its past values.
Q.4 What are the consequences of serial correlation ?
Ans.: 1. Pure serial correlation does not cause bias in the regression coefficient estimates. 1. Serial correlation causes OLS to no longer be a minimum variance estimator. 2. Serial correlation causes the estimated variances of the regression coefficients to be biased, leading to unreliable hypothesis testing.
Q.5 Define autocorrelation.
Ans. : Autocorrelation refers to the degree of correlation of the same variables between two successive time intervals. It measures how the lagged version of the value of a variable is related to the original version of it in a time series.
Q.6 What are reasons for censoring ?
Ans.: There are generally three reasons why censoring might occur : • A subject does not experience the event before the study ends. • A person is lost to follow-up during the study period. • A person withdraws from the study.
Q.7 Explain regression using statsmodels.
Ans. : Linear regression statsmodel is the model that helps us to predict and is used for fitting up the scenario where one parameter is directly dependent on the other parameter. Here, we have one variable that is dependent and the other one which is independent. Depending on the change in the value of the independent parameter, we need to predict the change in the dependent variable.
Q.8 Why residual analysis is important ?
Ans. : Residual (error) analysis is important to check whether the assumptions of regression models have been satisfied. It is performed to check the following : 1. The residuals are normally distributed. 2. The variance of residual is constant (homoscedasticity). 3. The functional form of regression is correctly specified. 4. If there are any outliers.
Downloaded by Saravanan Sujatha
SRM TRP Engineering College Department of Artificial Intelligence and Data Science AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
Q.9 What is spurious regression ?
Ans.: The regression is spurious when we regress one random walk onto another independent random walk. It is spurious because the regression will most likely indicate a non-existing relationship.