Data Science Using R

Uploaded by

yanele6282

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

26 views

Data Science Using R

Uploaded by

yanele6282

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 130

DATA SCIENCE e PART-A SHORT QUESTIONS WITH SOLUTIONS Qt. Write in short about data science. Answer! Model Paper, Qt Data seience is the combination of several fools, algorithms and machine leanting principles whose ‘ain is 0 discover the hidden patterns from raw data, ‘or example, consider the role of statisticians. Business administration Explanatory data analysis Analyst Data scientist Inthe above figure, the role of data analyst is to illustrate the processing history of data, And the role of data scientist is responsible for explanatory analysis for discovering insights andl to identify the reasons betind events in future by using the advanced mashine Ieornie data from various angles, Hence, data science is used to mk decisions and predictions Uhrough predictive casual analytics, prescriptive analytics ancl machine learn 2. Whatis linear algebra? : Answer Model Papers, a2 Linear algebra is said to be a branch of mathematics with respect to Tinear v wations and tinear {imstions along sith their representations through matries and veetor spaces ts unis esl Far all the areas of mathematics stich a8 geometry and functional analysis, The concepts are inples sequirements fo even derstand the linear algebra. The Jincar algebra is used in the form of scalars, vectors, matrices: and tensors. dala science. |. Scalar : 1 is-a single number Example : | ee Ea i 1 ae Scanned with CamScannerData Science Using R Vector: It is an array of numbers Example : Wis a2-D array 4. Tensor: It is n-dimensional array with n> 2 (1 2) B 2) lapis 4) \Voriotts operations that can be performed on them arc transposition. vector and matrix mutt identity and inverse matrices, etc. iphicatig Q3. Define linear equation. Answer : Mode! Paper, Linear Equation ‘An equation in variable x that can be written in the foto 1 form is called linear equat inx +b : , Ax + By=C Here, m, b, A, Bund C are realnuibers. A.and B must not be zero. The graph of any Tinear equa would be a straight line Let the linear equation be Ax = b ~ Where, A ism » n matrix of coefficient for m equations and n unknowns. xis ann | vector unknowns, x,. x. bis an m* 1 vector of constants is the right hand sides of equations, Q4. Define distance. Answer : Distance Distance is a function that calculates various dissimilarity or distance metrics. Itallows new i to be added, It is not used for large met Syntax distance(x, method » “euelidean”, sprange = NULL, spweight = NULL, icov) It-can also be used to compute and return the distance matrix through a specified distance measure compuiting the distances between the rows of data matrix. ‘Scanned with Cam$cannermR defining hyper pla: Data Science a yer gem Qs. Explain In bri about hyper pt ‘Answer t/ Model Papers, Ot Hyper Planes Hyper plane is a geometricentity geometrically whose dimension is one less than that ofits ambicrt space. For example for a 3D spacé the hyper plane is 2D anid for 2D space the hyper plane is 1D line and so on. Hyper plane can be defined by the below equation ° Xh+b=0 The above equation can be expanded for n dimensions. Xn, + Xn, +X 0, +... Xi, # B= O For 2 dimensions the equation is, Xn +X, +b=0 . Consider the hyper plane of the below form ‘ xTn=0 iLc., if the plane goes through the origin, the hyperplane also becames sub space, ‘The function hyper plane computes a(k~ 1) diniensional hyperplane thet passes through k given points in k-dimensional space. ‘The general format of this function is hypesplane (X). Q6. What is half space? Answer z Half Spaee ‘The half spaces of R® are suid to be the sets that are attached to (5, 6) © RY ~ R.s ¥ and it can be defined by, : (ER (x) +» the spade is two dimensional then the half space is known as half plane, The half space that isin -one dimensional space is Known as ray. Tkean be defined by linear inequality that is derived from linear equation which in tum specifies the “The linear inequality that is strict will specify an open half-supce, / BX, FAN oe FAX FB innit ‘Scanned with CamScannerData Science Using R Q7. Write short notes on eigen values. ae Answer yen values ore the numerical vals tha ett nr the uMnbe of Teale 0 Be retin " concept of eigen values is applicable to scare matrices, Its emsidered a6 2 ery important on example let the principal component analysis fs hosed an i, Enh eigen value definitely has comrespy sal cigen vector, The princi pment snalysis oF a system of variables ts perfirmied by compating ik ego value of dispersion matrix ar eorelation mani ef variables. This peurcipal eitapanent is eg, tobe the Lins ‘ar combination of items af corresponding eigen vector, The eigen values define the propartion of variance provicked for every cigen vector that js ing from transformations of original set of Variables ta orthogonal variables. This leads to adecrease in numb: Variables that are used in determining the majority’ of total variance among origi variables. IFeach oy: oT "ariable contributes in the direction of eigen vectors then te important variables can be summarized ink, number of vectors, Q8. Define eigen vector. Answer Model Paper Eigen Vector In linear algebra, i°T is alinear transformation from a vector space V overa field F into iscifandy 's non-zero vector in V then v is ealled an eigen vector of Tif T(v) is a scalar multiple of v i ‘Where v is a scalar in F known as eigen value associated with ef Yeclor-v. Eigen veetors are associated with linear models which are unusual in engineering as first appr’ nation. These vectors are watt by transformation matrix, Jt is applied on machine learning algorith 'n. This algorithm is masily useful in handling the large data sets. The concept of eigen vectors is considered ais a back bone For example, multiply a 2-dimensional vectar with a 2°? matvix, “Ty ap bos 03)2 6 This particular operation on vectors called lined transformation’ “The Golan tate represenis 2 of this algorithm. ‘eelor whose input and output veetor directions are not same. The vectors whose dimension docs not change ‘sr applying linear transformation with matrix are called eigen vectors, square matrices, ‘This concept is applicable cay on ao ad Scanned with CamScannerPART-B ESSAY QUESTIONS WITH SOLUTICHS 1.1 INTRODUCTION To DATA SCIENCE “gg, Give a brief introduction on data science, Illustrate various phases of It. Answer: Data science is the combination of several tonls, algorithms and machine learning principles whese to discover the hidden patterns from raw data, For example, consider the role of statistician 7 Analyst of data scientist is responsible for explanatory analysis far discovering insights and to identify the reasons behind events in future by using the advanced machine learning algorithms. This is dane by observing the data from various angles. Hence, data science is used to make decisions and predictions through predictive ‘casual analytics, prescriptive analytics and machine learning. | E Phases of Data Science ee. | ‘Various phases in data science life cycle are as follows, i Phase 1 t ‘Scanned with Cam$cannerData Sclence Using R SS Phase P: Diseavery aes fications, requirements, prior and required bua MUSE ee Koy understood before starting any project, In audition to this she business problem must be fi hypothesis must be formutited fhe tos I ae jy Phase 23 Data Preparation Inthis phasea sandbox is required to form analytics for complete Lime periad ofthe project), must be explored, preprocessed and structure the data before mocleling..And then ETLT (Extra |.oud and Transfarm) must be performed to extract the data into the sandbox. Statistical analy Fallows, y tet, Preparation af analyties scndhox + Performing EVLT + Data conditioning d Survey and visualize Here, R can be used for cleaning the data, transforming and then visualizing. With this the outliers can the determined and relationship ean be established. Alter this the explanatory analyt is performed on i, se 3 Model I Tn this phase, the methods to drav the relationships between the variables aré determined. With thisa base for algorithms is set forimplementing them in next phase. EDA (Exploratory Data Analytics) is appt ‘on various statistical forms and visualization tools. Some of the model planning tools ate as follow SQL analysis services : It is used to perform in database-analyties tnrouwls voxsmen dati ming functions and basie predictive model. R 2 It contains a complete set of modeling capabilities to provide goo env ieunsae «for bulking interpretive maces, i 3. SASMACCESS : It is used for necessin sf et ow i 1 ur ereating repeatalle and revsable Vp thay 259 the mest i data seen, ‘Scanned with CamScannerData Science UNIT-1 Phase 4: Model Huildi In this phase, the data set ate bull for taiising and testing puepose. In additional to this. various learning techniques such as association, classification and clustering are analyzed for building the model. Variony commonly used tools for model building aze as Follas. S enterprise miner (i) WEKA SPCS modeler Matlab (0) Alpine miner (vi) Statistica Phase 5 : Operationalize code and technical documents are delivered. In this phase, the final reports. bri Phase 6 : Communicable Results n this phase, the outcome is compared with the goal in first phase. The Key findings are identified ‘and communicated with stakehaluers to determine whether the project results are success or failure based on criteria develaped in phase 1 4.2 LINEAR ALGEBRA FOR DATA SCIENCE 10, Explain about finear algebra and its role in data science, Answer = tiodel Papert, Q44{2) and finear with respect to Lingar ey spaces. Its universal forall ie areas Linear algebra is suid to be a branel of mathemat functions along with their representations thravgh aiatrices and vector ‘of mathematies such as geometry and functional analysis, The eoaeepts are com ‘understand he finear algebra, The linear algebra is used inthe form of seals, wsetars. mnstrices and tensors requirements to even in data science: 1. = Scalar: Itis_a single number i Example: 1 . 2, Vector : It is.an array of numbers : ll 2 Scanned with CamScannerRae - Data Sclonce Using R Tensor : 1 is a n-timensianal array with n> 2 (24 (3 2] UWP TS 4] Various operations that can be performed on thei are transposition, vector and matrix my ili im, ulentity and inverse matrices ete. Linear Algebrie Operations on Vectors and Matrices: Linear algebra is a mathematical Pe omenog ny sdeals with vector space and their mappings. Reprogrammit ing supports linear algebric operat ions likes Alii and multiplication on vectors and matrices, ‘Multiplication operation on Vectors: The product between to vectors can be ealctilated in the following manner, Mire Packages » Windawe. Help Gea ‘order to compate the dot productinner produet between the two vectors a predefined function called ‘crossprod() is used, * Syntax crossprod( ) Example ‘Scanned with Cam$cannerf ' p ‘on Mairiees: In R programm for this is as follows, ramining are as Follows. ‘Apart fiom this, the various algebra flnctions that are available in R pro; 4+ TU): This function computes the transpose of n matri. 4 qrt):"This function finds the QR decomposition ae chol}: This function compares the ehle i ° det Tiiss fan i tlculates the determinant of 3 given matrix. igen): This funet es the eigen values ur eigen vectors. 4 diag( J: This function computes the diagonal of a square matrix 4. solvet.): This function solves the system vf linear equation. sweep: inetion performs complex niumerical aperatians. Among these functions diag( ), solve( ) and sweep( ) functions are the most important predefined functions. The functionality of these finetions i as follows, 1. diag(): This fonction computes the diagonals of square matrix. Intakes two types of arguments either ruatriy or either vector). IF matrix is taken ds argument then the resultant output will be a vector whereas if vector values ar¢ taken as argument then the resultant output wil! be a matrix, (a) Gt a8 2, 4 , a5) ' 2, dea ide) ty Cad “ ‘Scanned with Cam$cannerData Science U 2 Ising R elve( This fimelion solves system of linear equation and also caleulatey heingy ample Of, ty Consider the below linear equation MaKn4 6 ‘The matrix representation of above equation is, Hel The code fur 1 ‘The above obtained output is the inverse ofthe actual output. That is, initially, he mil, the sob Solves the given linear equation and then takes the inverse of the output and display ton scree, 3 sweet This fiction is ‘sed lor performing complex operations on numerical value, Example 1.3 Linear Equations Q11. Discuss about linear equations, Answer : i : ‘Model Papers, ctlt Linear Equation ‘An equation in variable x that can be written in the Following form is called linear equation. yemetb Ax4By=C Here, m, b, A, Band C are real numbers, A and B must not be zero. The graph of any linear cali would be a straight | 10 ae ‘Scanned with Cam$cannerData Science UNIT-1 Linear equations can be solved by following the below steps, Initially the equation of all the fractions must be cleared hy multiplying the bath sides oF equations by. tcp (Lowest Common denominator) of the fractions. Every side of equation must be solved completely by distributive property in order to delete the paranthesis and to combine the terms, Now isolate the terms of vaviables at one side of equation and numbers on other site of equation through addition property of eqn 4, Genefate an equation with variable whose coefficient is I by'using the multiplication property 3. Finally check the answer in original equation, Let the linear equation be A\ Where, + — Aism xn matrix of coefficients form equations and n unknowns. + xisann x 1 vector unknowns, x), Xu. %y + — bisanm « | vector of constants is the right hand sides of equati Conditions for the solutions are as follaws, @ The equations are consistent ifr(A | b) = (A) + The solution is unique if r(A | b)=r(A}=n + The solution can also be undetermined if (A |b) (ii) The equations are inconsistent if r(A | b)> r(A) To demonsrate the ranks, use ' (RCA), R(cbind(A, b)) and to test for consistency, use all.equal (RCA), R(cbindA,b))). Equations in Two Unknowns Every equation in two unknowns will correspond to line in 2D space, Ifall the lines ae intersecting ; at one point then the equations can have unique solution, ‘True Consistent Equations: " Ae matrix(CX1,2, =1, 2), 2,2) bee!) * ; ‘ ‘ - ShowBgn(a, by s \ ‘Scanned with Cam$cannerData Sclence Using R AV Dean #2 ~ Vary Data CERKAD. RECbindLA, bY) show ranks - UY 22 “ allequal(R(A), Ricbind(A, b))) Hconsistent ae #U-TRUE, Pot the Equations 5 x ‘Equations can be plotted as shown below. PlotEqn(A, b) xp ete? HH Derr x2=1 Bxle2sat ~ The solution can be more comprehensibly determined by solve( ) finetion. solve(A, b, fractions = TRUE) HANI = 5/4 * tx? = 3/4 Inthe similar way, three consistent equations thres inconsistent equations, equations inthe can also be determined. : SE ‘Scanned with Cam$cannerData Science UNIT-1 1.4 pistance Q12, Write In detail about distance. Answer : Model Paper4, Q14(b) Distance Distance is a function that calculates various dissimilarity or distance metrics. It allows new metrics tobe added. It is not used for large metrics bu purely a choice for understandability and extensibility Syntax distance(x, method = “euclidean”, sprange = NULL, spweight = NULL, icov) Tecan also be used to compute and return the distance matrix through a specified distance measure for computing the distances between the rows of data matrix. Syntax . dist(x, method = “euclidean”, diag - FALSE, upper = FALSE, p=2) Arguments x: Itrepresents a numeric matrix, data frame or ‘dist’ abject with row and samples and columns as variables, The distance will be computed for every pair of rows, method: ‘It calculates one‘of the various dissimilarity metries such ds euclidean, bray -curtis, manhattan, mahalanotis, jaccard, difference, sorensen, gower, modgower 10 and modgower 2. : ‘sprange :- The gower dissimilarities allow to divide based on species range. If the value of it is NULL then no range is used, ifthe value of it is vector of Jength.nrow(x) then if is used to standardize the dissimilarities. digg: — Itrepresenit whether the diagonal of distance matrix to be printed by print. dist. For this it uses logical values. ‘ ; spweight: Weighting is allowed by euclidean, gower.and manhattan dissimilarities. Ifthe vatuc off itis NULL then no range is used, if the value of it is absence then w = Q-and ifthe species are absent and 1 then joint absences are detected, = upper: Itis logical value that represents whether the diagonal of distance matrix must be displayed . by print, dist. : cov; This optional covariance mattix that is used if method = “mahalnobis® I allows to calculate the distance for a subset of full dataset if it is provided directly. It indicates the power of minkowski distance, sys ‘Scanned with CamScanner" a nee Using R & This object jy, Ieretuins a lower-~triangular distance matrix as an object of class “dist ei attribute, ze Tis an integer that indicates number of observations in dataset ‘Scanned with CamScanner | labels: 11 is an optimal value that consists of labels inease of observations of dataset \ diag, upper: I is logical value that is related to the argument diag and upper that depict howy the hie be displayed. all {Wis an optional value that is used to create an object 1,5 HYPER PLANES, HALF SPACES 4 13: Explai about hyper planes, Answer Hyper Planes . ® Hyper plane is a geometric entity geometrically whose dimension is one less than that ots apg space. For example for 83D space the hyper plane is 2D and for 2D space the hyper. plate i 1D lps sy, on. Hyper plane ean be defined by the below equation, Xn+b=0 ‘The above equation ean be expanded for n dimensions, For 2 dimensions the equation i, Xa, +X n+ b= 0 Consider the hyper plane af the below form, xTa= if the plane goes through the origin, the hyperplane also becames'sub space. “The function hyper plane computes a{k — 1) dimensional hyperplane that passes through k git: points in k-dimensional space, “The general format of this funetion is hyperplane (X), here, X indicates a numeric k » k matrix with k data point as rows, 14Data Science UNIT-1 ‘AE —1) dimensional hyper plane i R¥ contains the points tha n be saished by x dxtc=0 here d is k vector and c is scalar, ‘The Finetion will relumn (k + 1) vestor (4, 6) {tcan be normalized in such a way that the length of dis equal to (k — 1),timesthe (1) dimenstowal volume of simplex that are farmed by points on plane. Ifthe value af k is 3 then it would be a triangle ‘Therefore the function and compute volumes of simplices. The direction of d towards the origin is exible based on the order of data points within the mairix X. 1f points eamsnot elefine (ke ~ 1) dimensional hyper plane then a veetor with zeros is returned. Example Xe rhind U4, 5), C8, 2)) hyper plane(X) X € rbind(C(s, 2), C(4, 5)) hyper plane(x) X < diag(rep(l, 3)) hyper plane(X) Q14. Discuss in detail about half spaces. Answer Model Papers, 11a) Half Space : ‘The half spaces of R* are said to be the sets that are attached to (, r) & Rv R, 0 and it ean be defined by, fee R(x) Sr} for closed half space (ER: (5,x)." Ifthe space is two dimensional then the half space is kaown as half plane, The half space that is in cone dimensional space is known as ray. 15 ‘Scanned with CamScanner2 equation whi ‘derived from an open hal esapcs. ity that is strict will specify Fax ob ar inequality that is not strict és called closed half-space. ‘Scanned with CamScanner 38, Fay +e bay 2b Consider the bclow piven two dimensional space. H +X, ve half of plane ae i ‘An equation in two dimensions can be a line that can must be hyper plane. So equation ioncy, can be written as, . i Xn+b=0VX cline In these two dimensions the line can be xn, +10, +b=0 This line be extended on both sides even. If this is done the two dimensional space is divided iy, | two spaces :Data Science unira ‘One space is at ane side of the line i.e, at right side and another space is at the other side of the Tne ic., at let side, These two spaces are called half spaces, For example if there are points an one half space ‘and points on the other space, Is thereany characteristic that ean separaic them’ A solution for this would he toperform certain computations on one half space fur al the points and obtain some result, Repeat the same procedure on the other side and use the results to make the decisions, These type of situations are mostly observed in classification problems. Consider a binary classification problem, to know on which half space the point lies in. And now consider three points X,..X, and X, from the above figure and! distinguish their positions. In the equation Xn + 0, nis said to be normal in thi ‘The above figure, ifn is considéred as normal in equation X7n +b =0 and ifthis equation is multiplied by -I then normal is said to be defined to side of n, Otherwise normal is said to be defined in the opposite direction of n, To know where the points X,, X, and X, lie, the equation Xn +b = 0 must be evaluated. xTn+b XIntb Thtbo* Forthe equation X7n +b 0, itis clearshatihe point fies on theline so it evaluates to0. Now consider tie equation Xn +): In the above figure; take two points X' and X,, Here X, is a vector from to X,. Fron vector addition, itcan be written as, : xe ‘This must be substituted in the equation, : weyyed XT+b+¥'Tn ‘Scanned with CamScannerIfthe point ties in oF IV" quadeant, then the angel would be a positive @ angle, Foy in | thoy i a dot matrix ath = ‘aifbleos 8 the 0 angles might be between the two vectors. For any ping, ig *y , , { 270° to 360%, the equations Yn evaluats to a positive value since a°b is also positive . | Xn b+yn > 0 ‘ the points are at the opposite side ie., between 90° to 180° or 180° to 270°, The cos 6 for angles between 90 to 270 would be anegative value. Therefore for any poj,, te (on this side of line or half space, the computation Xfm + b would be less than 0. Xtneheo | Example Consider a2D geometry with n= “ ix [Joa X?+b=0 "end b=4 jj] andb=4. x fx tixygtd=0 ler three points, (—1,-1), (1,~1) and a 2), Substitute these points in the above equation | Cb x 4aepe aso a+d=0 ‘The point (-1,—I).is said to be on line. 2 GD nt3n ded 1=3+4=2>0 |, ~1) is said to be in positive half space, — 13 3. Tie point all ‘Scanned with Cam$cannerData Science UNIT-1 2 Gd) X,43x,440 1-6442-1<9 The point (1, -2) is said to be in negative half space 1.6 EIGEN VALUES, EIGEN VECTORS Q15. Write about eigen values, Answer : Model Paper-Ill, Q14{b) Rigen Values + Eigen values are the numerical values that can determine the number of features fo be retained. The concept of eigen values is applicable to square matrices. It is considered as a very important topic. For ‘example let the principal component analysis is based on il. Each eigen val definitely has corresponding « ‘eigen vector. The principal component analysis of a sy$tein of variables is performed hy computing the eigen value of dispersion matrix or correlation matrix of variables, This principal component is considered to be the linear combination of items of correspon: i eigen vector. ‘The cigen values define the proportion of variance provided for every eigen vector that is derived from transformations of original set of variables to orthogonal variables, This leads to a decrease in number of variables that are used in determining the majority of total variance among, otiginal variables. If each original variable contributes in the direetion of eigen vectors then the important variables can be summarized in less number of vectors. Consider the below mathematical formula, Ax=hx - Here, constant 4 (positive) represents the amount of stretch or shrinkage that the attributes x go through the x direction. ‘Scanned with Cam$canner8X are called eigenvectors and their corresponding? ae called wigan ym atrix, the eigen values and eigen vectors can be computed as fallowys, The eigen vatties can be computed as follows, AK = Ax. Atn ays xt 1) AX —AIk = 0 (A-2)x = 0 Therefore the eigen values of the equation can be determined by using the below canis |A-Al| =0 2. By substituting the eigen values in original equation the solution for cigen vector x an be com Example Consider the below matrix _[8 7’ “bl Bis] _ al |_ day 23) ,e}> “(x |7]an a7 ro [spol Bae 7 2 3-2 =0 |A-All = (@-NG-H-14=0 A-1IA+10=0 40,1) -—_— TSS 20 ‘Scanned with Cam$cannerData Science UNIT-1 R code is ns follows, > RE-MALEAC(C (87,2, 31-2424 yEOWAT) ‘Therefor2, there are two eigen values, To, comptite the eigen vectors considers the below process. ee Pee ESI) E | Bxy+7xy Therefore the corresponding eigen vector to = Vis, dnt Bi xX 4X,= Pe IfaA=10 8 7]}x |_| tox e 23][m| [lon 8x, + Tp ]_[10x, 2x, 43x,] [10r, 21 ‘Scanned with CamScanner‘Scanned with CamScanner & RECrWEE AM (5EE47,2/21,2,2. YEON) [> avcesgen ta) Retationsh between Eigen Values and Eigen Vectors ‘Theeigen values ean be complex numbers even fo real matrices, the eigen values become compley than eigen vectors also become complex. TF the matric is symmetric and if this symmetric is in the following, AnaAT then there are following properties @ Ifthe matrix is symmet thon cigen values will be real always Gi) Eigen vectors of the symmetric are also real For a matrix and for Vp VovenV, for symmetric matrices Q16. What is an Answer : Model Papersil, 118) gen values Ry 2, wy h, then there linenrly independent eigen veetors such ss jen vector? Explain, Eigen Vector Jn linear algebra, if T is a Tinear transformation from a veetor space V over a field F into itselFand is nonzero vector in V then v is called an eigen vector of T if T(v) isa scalar multiple of v ie, To)= Where v isa scalar in F known as eigen value associated with eigen vector v. Eigen vectors are associated 22 alData Science UNIT-1 ‘Tiih linenr models which are wnasial ih enginesring as first approximation, These veetors are wnrotated uy is mostly vusefl in aluerithi, by transformation mavix, It is applicd on machine learning algorithm. This aly handling the argc dla sets. The concep! of eigen veetery ix considered as a back bone of thi For example, multiply a 2-dimensignal vector with a 2°2 mattis, 12)1.3 03)2 6 ‘This particular operation on vector is called linear transformation, ‘The cofunn matrix represents a jgetarwhase input one! eutpul vector directionsare not sume, The weetars whose dimension does notch after applying finear transformation widh matrix are called eigen Vestors, This cangept is applicable only: square matvices. Finding Eigen Vector of a Matrix Consider a matrix M and cigen veelor ‘e’ corresponding to the matrix, “The direction of **remaiis unchanged when multiplied with anatrix, only has @ change in magnitude, Consider the below equation, Me (M—C)e=0 ‘Interms of (MC). C indicates an identify matrix of order equal to “MF that is multiplied by a sealer *c'. There are two unknown “e" and ‘x’ and one equation, This equation can be solved by making the veetor *e°as zero vector. Then there will be only a single choice that, (M-C) is a singular matrix. [t has a property ‘that ifs determinant is equal to 0, This property can be used to find the value of ‘c". © Det(M-C)=0 ‘This produces an equation in ‘c* that is in the order based on matrix M, ‘This needs a solution for equation. If the solutions are ‘cl", “e2" and so on then place *c1” in the eqnation and find vector “61” corresponding to ‘c1*. The vector ‘el isan eigen vectar of M, This procedure anist be repeated with *¢2°, “c3" and soon, . Example Ecit_ Mis > He~maerix{e(80,31,20,$1,50,51, 60, 61,70) ,nrewss, byrow=T) > xeceagen (Mf) > xSveiues £2) 147,737876 §.317459 -2.055095 i > xSvectors La 21 Lay {2,} -0.3968974 0.9897557 -0.7447e185 (2)] -0.8497487 -0.8198420 -0.06303763 a) =0.7961272 0.366296 0.6643239. : > ‘Scanned with CamScannera Scien aR Q17. IMustrate thi usage of eigen vectors in data scient Answer : Theconceptofcigen vectors is applied i e., machine learning algorithm principal Component an, re is data with huge set features It has high dimensionality. There mightbe redundant feature ina ab, ‘These. features make the eff ieney to reduce and disk space to inerease. But the PCA craps joo. ie ‘The cigen vectors help in defining these features. ha Consider the PCA alg fo perform this are as follows. ‘ith for ‘n’ dimensional data that are to be reduced to *k* dimessiong 5 "ts Step 1 Initially the data is mean normalized and feature scaled. Step2 ‘The covariance matrix of the data set is computed. To reduce the number of features (dimensions) de, the features must be deducted. But this ead Joss of information. So, loss of information need to be mi ized and maintain the maximum varanes, fy this, the directions of maximum variance must be determined. This is done in the next Step. Step 3 Un this step eigen vectors of convariance matrix is determined. Since there is data in ‘n' dimes, then ‘n’ cigen vectors corresponding to ‘n’ eigen values are deterinined, —~ Step 4 Select *k’ eigen vectors corresponding to ‘k’ largest eigen values and then build matrix in whieh evey eigen vector that represents columns. This matrix is called as In order to reduce a data point ‘a’ in the data set to ‘k’ dimensions, the transpose of the mairixU must be determined and then multiplied with vector ‘a’. Then the desired vector in *k” dimensions is obtained. 24 a ‘Scanned with Cam$cannerSTATISTICAL MODELING PART-A SHORT QUESTIONS WITH SOLUTIONS 1. Define statistical modeling. Answer: . Model Papers, 3 Statistical Modeling Siatistica! modeling can be defined as the formalization of relationship in between the variables is the form equations, ts actually about finding out the variable, It explains about how variables ae related ‘with each other, The relationship can be in the form oF mathematical equations. And the variable can be an attribute such as height, weight or age ofa person. | . The variables | analyzing and applying it on varios circumstances, Statisticel rodeling gives the introduction and illuminates the statistical reasoning that is uscd is modem research throughout the natural s well as medicine, ecommerce, social sciences. government et. It also focuses on the usage of inodels to untangle and quantify the variation on observed data. Q2. What is a random variable? Answer = Moulel Paperstl, 03 Random Variables +2" Random variable is variable that takes particular value .e., numerical valuc with definite probability It is obtained from the resull of rangom experiment. The random variables are denoted by capital letters and the corresponding letters are denoted by srual letters. ‘ Example = If'g fair dice is rolted and if'X* denotes the number obtained then *X” is called as random variable. “Thus *X' can take any one of the particular values such as |, 2, 3, 4,$ of Geach with n probability 1/6. These | Values are tabulated as follows. * ‘ ‘Scanned with CamScanner 1 not be related accurately but ean be stochastically related, It consists of data .nce Using R All the possible outcomes of random experiment together is called “Sample Spacg ay * The sum of all probabilities of sample space is # always. "ay oa Random variates are of two types, they arc, () Discrete random variable. Gi) Contimious random variable. 3. Write in short about hypothesis testing. Answer : Mode bop Hl, Hypothesis Testing 7 ‘The statistical hypothesis can be defined as an assumption with Fespect to a populati Ey Mot be true. It is a set of formal procedures that is used by stalisticians for accepting g- Atwtistical hypothesis. Infact itis process of validing the hypothesis that is made by rescarchey, the hypothesis, the complex population is considered. 10" Chan ni "Sean 8. Fay 4 this process it makes use ef random samples from the poptlation, The selectiy hy MOF recta pothesis depends om the result of testing over the sample data. et 4. State the types of errors occur in hypothesis testing. Answer : Model Papers a, Types of Errors ‘There are two types of errors that exit occur in hypothesis testing. 1 Type! Error Taccurs when the null hypothesis is rejected while its value it true. The probability of this enocen be defermined through the term sighificance'level when the hypothesis is tested. The significance levis denoted by the symbol ot (alpha). 2 Type tt ferer Type Il error ean be defined as the acceptance of false null hypothesis H.'The term called poweraftsx defines the probability of type LI error when the hypothesis testing is performed. It is represented by synto\ B (beta), ee QS. Define p-value. Answer + ‘Model Paper-i, 4 p-value ‘The p-value can be dofinedas the probability of obtaining result that is equal to.or more than observaon from data when null hypothesis is true, Hypothesis testing makes use of p-alus to actully use pvalo fo weight the strength of vie data of population, The p-value can be computed for the given data through a statistical tes. Tt ele compared with predetermined value i.e. alpha. usually the value of alpha will be 0.05. If itis less @! oak then null hypothesis is rejected and if it is more or equal than alpha then rejection of null hypothes! 26 esl ‘Scanned with Cam$cannerStatistical Modeling UNIT-2 ae fe PART-B gore ESSAY QUESTIONS WITH SOLUTIONS 2.1 STATISTICAL MODELING Q6. Discuss about statistical modeling. Answer: Model Papers, @12(a) Statistical Modeling Statistical modeling can be defined as the formalization of relationship in between the variables in the fonm equations. It is actually about finding out the variable. It explains about how variables are related with each other. The relationship can be in the form of mathematical equations. And the variable can be an ute such as height, weight or age of a person. The variables might not be related accurately but can be stochastically related, Statistical modelin, ‘consists of data analyzing and applying. it on various circumstances. Example The attributes such as height and age are probabilistically distribited amang humans. They are stochastically related i.e., if a person is of age 35 then this influences the chance of this person being 4 feet {all and if a person is- of age-15 then this influences the chiaice of this persan being 6 fect tall. Model 1 Height, = 6,+8,ape,+¢, “ Where, a 8 intercept, 6, is parameter that age is multiplied to generate a predi € is the error term and davutd ‘is subject, Model 2 by bage, +b, sex, +e, itistical reasoning that is used sciences, governmentetc. It "Statistigal’ modeling gives the introduction and il sli modem research throughout the natural as Well as medicine} ‘eBrnimérce; 5 also focuses on the usage of models to.untangle and quantify the.variation of observed.data: « °? “Tesiplaté for statistical model mould be a linear regression model with independent and homoscedastic errors. ysi=sum_{j =.0}"p beta jx_ti}+ei, c ‘ oo . i Fed, i ‘Scanned with Cam$cannerData Si Where, ce jare NEO (0. sigmna”2) Inv vate terms tis eam be ete 8 y =X belate Win 0, Hone rdesign mati widneolumes ig response weetor, Xis model mal cariables. More frequently X_0 mould be a column of ones by defining a wn intercept term. c J! from potentially large sc of Fieal models are ifusirated as jy? “Dypes of Statistical Modeling sing the mvnimal adele mod uns cho' : ‘Various types of statist simplification. Statisiieal inodelling models by using stepwise mo Model Tnterpretation. ‘Saturated model ‘ne parameter for each dati point Fit: Perfect Degrees of freedom : None Explanatory power of model : NONE Treonsists all (P) factors, i Factions and covariates of any interest, Moy of the madels ean be insignificant Maximal model Degtee of freedom : t-P— Explanatory power of model: Depends. simplified model with 1 gamma (20, 10) J) (2) 8.322375 is.661ses 10.s27896 18.807450 LD.sa2sE2 B.1E1262 126780455. {8} 10.709388 11.s49666 11.256586 16,979900 10,419608 15,895826 10.052508 Hal 8.436457 10,269957 6.191293 9.510985 8.270894 14.367074 > 5 recom Ttretums n random numbers from geomettie distribution. ‘resom(n, prob) Heren indicates n indicates number of observations and prob indicates probability a success in each ackages 5 Windover» > set .seed(2) >-egeames, 1/6) 6 rlnorm Il generates random amounts with a multivariate lognormal distribution or density of this particular distribution at some specific point. ~rinorm(n, meanlog, varlog) : Here n indicates number of data sets that are to be simulated, meatilog indicates the mean-vector of | logs and varlog indicates the variance/covariance matrix of the logs. S$. é 31 ‘Scanned with Cam$cannerData Science Using R Example Ble bait Misc Packages: Windows Helo > doe (ztasen(s)) 1) 0.210731885 o.oesa9s6q7 16) -1.246783429 9,99815995 0.580872" \ 122) =1.4508639¢5 9.3s0909791 ~9.47452602 26) -1.087292503 2,03G203603 -0.926989232 sae 2-763246020 o_zeez02760 ~2.252558924 -1. 29956975 riS08a8sE13 o_s275¢0097 -o.sassae57s -0.9FE37EALS -0.7205¢5 30291196 o.eT7B¢a42 0.452793: 76 earl 2-85600373¢ a ogess922 g.a7G¢03855 9278215449 -2/87790294 ~O.B26s26142 Lloia7rog6a 891277732 0-742002772 9147573408 e O-425365565 o.isiesccan o.agi9ae754 9.225422912 -1. 010465085 -a,462689253 0.81083980 ~1.912248796 oo -0.216375791 -2.621957255 35402726: 7. logis This function depicts information about logic distribution. Kt generates random devia Hlogis(a, location0, seale=1) ese minicates numberof observation, eeatonandscalehave0 and 13 Fm values nog Example ; }|> vax (=tagas (1000, 0, seals = 5)) 8. rmvbin ereates corretated multivariate binary random variables by thresholding a normal distributing rvbin(n, bincor, margprob) Here n indicates number of realization of variables that are to be simulated bincorr isa mains! margprob indiestes the vector of some length, Example rmvbin( 10, margprob = C(03, 0.9)) «pois Il generates values from poisson distribution and returns the results, rpols(ob, rate = rate) Here, ob indicates the number of observations and rate indicates estimated rate ef events for dss 32 ‘Scanned with Cam$canner\> Statistical Modeling UMiie2 Example It generates random compositions with uniform distribution. if(9, win, max) Here n indicates number of observations, min and max are by default 0 and | respectively. Example as [4] ~0.81133706 ~0.03129085 ~0.s7e Hine Bn, 8 ‘There are even other types of hypothesis testing, they are as fullows, Simple Hypothesis Simple hypothesis ig a statistical hypothesis which completely specifies an exact Paraneter py hypothesis is always a simple hypothesis stated as an equality specifying an exact value OF paramet, “ Example bo Hw=y, 2 Hor y,—p Complete Hypothesis Composite hypothesis is stated in Trims of several possible values Lc, by an inequality. Aten hypothesis is a composite hypothesis invalving statement expressed as inequalities such a8 <> ore, Example 1 Aopen, a H, BSB 3. A pep, Example of Hypothesis Testing ‘Consider an example, to check whether a coin was fair and balanced, According to mull hypothes: the half flips would be of head and half would be of tails. And according to alternative hypothesis the iss of head and tail may be different. Hy: Ps05 H,:P#05 for 50 times, might result 40 heads and LO tails, Based on the result the null hypat®* must be rejected and concluded according to the evidence that coin was not fair and balanced probebl- —_—_—_—— Flipping of ©