DEV UNIT 1&2 Notes

Uploaded by

sthiruselvam38

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

43 views118 pages

DEV UNIT 1&2 Notes

Uploaded by

sthiruselvam38

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 118

45 — Scatter Plots and Resistant Lines... 4.5.1“ Bi-Variate Analysis using Scatter Plot. 4.5.2 Bivariate Analysis Resistant Lines... 4.6 Transformations. Review Questions with Answers 4.7 Two Marks Questions with Answers. 5.1 Introducing a Third Variable 5.2 Causal Explanations. 75.3 Three Variable Contingency Tables and Beyond. 5.4 Longitudinal Data. 5.5 Fundamental of Time Series Analysis (TSA) and Characteristics of Time Series Data 5.6 _Data Cleaning 5.6.1 Data Cleaning Concept and Methods .. 56.2 _ Data Cleaning using Time-based Indexing, Visualizing, Srouping and Resampling Review Questions with Answers .. 5.7 Two Marks Questions with Answers Solved Model Question Paper......UNIT I EDA fundamentals - Understanding data science - Significance of EDA ~ Making sense of data - Comparing EDA with classical and Bayesian analysis - Software tools for EDA = Visual Aids for EDA - Data transformation’ techniques-merging database, reshaping and pivoting, Transformation techniques -"Grouping Datasets - data aggregation - Pivot tables and cross-tabulations. i . Contents 1.1 Data and EDA Fundamentals 1.2 Understanding Data Science [43 Significance of EDA 1.4 Types.of Exploratory Data Analysis 1.5 Making Sense of Data 4.6 Comparing EDA with Classical and Bayesian Analysis 1.7 Software Tools for EDA 1.8 Visual Aids for EDA 1.9. Data Transformation Techniques 1.10 Merging Database (Using Pandas Library) 1.11 Reshaping and Pivoting 1.12 Two Marks Questions with Answers (1-1)Data Exploration and Visualization’ (1-2) + Exploratory Data Analysis © 1.1 Data and EDA Fundamentals ©) 1.1.1 Data and Information * Data (Singular Datum), is collection of different facts, figures, objects, symbols and events that are capable of providing informiative pieces usually formatted in a particular manner Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of things. * Data can be qualitative or quantitative. Qualitative data is descriptive information (jt describes something). Quantitative data is numerical information (numbers). * Quantitative data can be discrete or continuous, Discrete data can only take certain values (like whole numbers). Continuous data can take any-value (within a range), Discrete data is counted, continuous data is measured. Examples of Data Nowy seme Caligens Qualitative Oni: S Ae of * My friend's favorite holiday destination. ©. The most common names in India. | * How people describe the smell of a new deodorant. i Disentt= i Quantitative ——T x\ : Contin Kernen} © Height (Continuops). * Weight (Continuous). * Leaves on a tree (Discrete). *_ Customers in a shop (Discrete). Information * Information is defined as classified or organized data that has some meaningful value for the user. Information is also the processed data used to make decisions and take action. Processed data must meet the following criteria for it to be of any significant use in decision-making : © Accuracy - The information must be accurate, J © Completeness - The information must be complete, © Timeliness - The information must be available when it’ 's needed. TECHNICAL PBLICATIONS®-en upon ov inoeapeen i YK Os er aes eae Data Exploration and Visualization (1-3) ® 1.1.2 Data Collection Methods * — «Data collection is defined as a method of collecting, analyzing data for the purpose of — validation and research ising some techniques. Data collection is done to analyze & problem and learn about its outcome and future trends. When there is a need to arrive at a solution for a question, data collection methods help to make assumptions about the result. in the future. Fas 3 Data collection methods can be classified as, 4 yay : This is original,, first-hand data collected by the data researchers. This first-hand data process is the initial information gathering step, performed before anyone carries out any further or related research. Primary data results are highly accurate provided the researcher collects the information. However, there is a downside, as first-hand research is potentially time-consuming and expensive. Secondary : Secondary data is second-hand data collected by other parties and already - having undergone statistical analysis. This data is either information that the researcher has tasked otlier people to collect or information the researcher has looked up. Simply put, itis second-hand information, Although it is easier and cheaper to obtain’ than « primary information, secondary information raises concerns regarding accuracy and authenticity. Quantitative data makes up a majority of secondary data, ‘Types of data collection : ° Fig, 1.1.1 : Data collection methods bes * Below are widely used data collection methods, 1. Interviews 2. Questionnaires and surveys 3. Observations 4. Documents and records 5. Focus groups’ 6. Oral histories.Data Exploration and Visualization (1-4) Exploratory Data Analysis ‘© Some of the above methods are quantitative, dealing with something that can be counteg, Others are qualitative, meaning that they consider factors other than numerical values, jn general, questionnaires, surveys and documents and records are quantitative, while interviews, focus groups, observations and oral histories are qualitative. There can also be» crossover between the two methods. Primary data collection © Interviews : The researcher asks questions of a large sampling of eat, either by direct interviews or means of mass communication such as by phone or mail. This method is by far the most common means of data gathering. © Projective Technique : Projective data gathering is an indirect interview, used when potential respondents know why they are being asked questions and hesitate to answer. For instance, someone may be reluctant to answer questions about their phone service if a cell phone carrier representative poses the questions. With projective data gathering, the interviewees get an incomplete question and they must fill in the rest, using their opinions, feelings and attitudes. © Delphi Technique : The Oracle at Delphi, according to Greek mythology, was the high priestess of Apollo’s temple, who gave advice, prophecies and counsel. In the realm of data collection, reséarchers use the Delphi technique by gathering information from a panel of experts, Each expert answers questions in their field of specialty and the replies at consolidated into a single opinion. Focus Groups : Focus groups, like interviews, are a commonly used technique. The group consists of anywhere from a half-dozen to a dozen people, led by a moderator, brought together to discuss the issue. © Questionnaires : Questionnaires are a simple, straightforward data collection method. Respondents get a series of questions, either open or close-ended, related to the matter at hand. 1 ‘Secondary data collection ¢ Unlike primary data collection, there are no specific collection methods. Instead, since the information has already been collected, the researcher consults various data sources, sueH as: : o Financial Statements © Sales Reports o Retailer / Distributor / Deal Feedback TECHNICAL PUBLICATIONS®- on pt or omoage- 5 Data Exploration and Visualization (1-5) Exploratory Data Analysis ©. Customer Personal Information (e.g., name, address, age, contact info) o Business Journals © Government Records (e.g,, census, tax records, Social Security info) o Trade/Business Magazines : o The Internet. Data collection tools : Below are widely used tools for data collection. © Word Association : The researcher gives the respondent a set of words and asks them what comes to mind when they hear each word. Sentence Completion : Researchers use sentence completion to understand what kind of ideas the respondent has. This tool involves giving an incomplete sentence and seeing how the interviewee finishe: . Role-Playing : Respondents are presented with an imaginary situation and asked how they would act or react if it was real. « In-Person Surveys : The researcher asks questions in person. © Online / Web Surveys : These surveys are easy to accomplish, but some users may be unwilling to answer truthfully, if at all. Mobile Surveys : These surveys take advantage of the increasing proliferation of mobile technology. Mobile collection surveys rely on mobile devices like tablets or smartphones to. conduct surveys via SMS or mobile apps. Phone Surveys : No researcher can call thousands of people at once, so they need a third party to handle the chore, However, many people have called screening and won't answer. * Observation : Sometimes, the simplest method is the best. Researchers who make direct observations ‘collect data quickly and easily, with little intrusion or third-party bias. Naturally, it is only effective in small-scale situations, 1.1.3 Common Issues / Problems in Data Inconsistent data * When working with various data sources, it is conceivable that the same information will have distrepancies between sources. Thé differences could be in formats, units or occasionally spellings. The introduction of inconsistent data might also occur during firm Mergers or relocations. Inconsistencies in data have a tendency to accumulate and reduce: TEOHVIGAL PUBLICATIONS® an pst for inodeData Exploration and Visualization (1-8) Exploratory Data Analyaig the value of data if they are not continually resolved. Organizations that have heavily focused on data consistency do so because they only want reliable data to support. thei analytics. Data downtime - _ = Data is the driving force behind the decisions and operations of data-driven businesses, However, there may be brief periods when their data is unreliable or not prepared, Customer complaints and subpar analytical outcomes are only two ways that this data” tumavailability can have a significant impact on businesses. A data engineer spends about 80 % of their time updating, maintaining and guaranteeing the integrity of the data pipeline, In order to ask the next business question, there is a high marginal cost due to the lengthy operational lead time from data capture to insight. = + Schema modifications and migration problems are just two examples of the causes of data downtime. Data pipelines can be difficult due to their size and complexity. Data downtime must be continuously monitored and it must be reduced through automation. ae Ambiguous data + Even with thorough oversight, some errors can still occur in massive databases or data, * lakes. For data streaming at a fast speed, the issue becomes more overwhelming. Spelling mistakes can go unnoticed, formatting difficulties can occur and column heads might be deceptive. This unclear data might cause a number of problems for reporting and analytics. ~ Duplicate data a * Streaming data, local databases and cloud data lakes are just a few of the sources of data that modem enterprises must contend with. They might also have application and system” silos. These sources are likely to duplicate and overlap each other quite a bit. For instanee, duplicate contact information has a substantial impact on customer experience. If certain~ prospects are ignored while others are engaged repeatedly, marketing campaigns suffer: ~ The likelihood of biased analytical outcomes increases when duplicate data are present I can also result in ML models with biased training data. 7 Too much data ‘© While the data-driven analytics and its advantages are prominent, a’data quality prol with excessive data exists, There is a risk of getting lost in an abundance of data searching for information pertinent to analytical efforts, Data scientists, data analysts and business users devote 80 % of their work to finding and organizing the appropriate dal With an increase in data volume, other problems with data quality become more serious particularly when dealing with streaming data and big files or databases. TECHNICAL PUBLICATIONS® « an up-hrust for knowledgeee EE LT ee Le ee) Data Exploration and Visualization (1-7) __ Exploratory Data Analysis Inaccurate data For highly regulated businesses like healthcare, data accuracy is crucial. Given the current experience, it is more important than ever to increase the data quality for COVID-19 and later pandemics. Inaccurate information does not provide a true picture of the suustion and cannot be used to plan the best course of action. Personalized customer experiences and marketing strategies underperform if the customer data is inaccurate. © Data inaccuracies can be attributed to a number of things, including data degradation, human mistake and data drift. Worldwide data decay occurs at a rate of about 3 % per month, which is “quite concerning. Data’ integrity can be compromised while being transferred between different systems and data quality might deteriorate with time. Hidden data F © The majority of businesses only utilize a portion of their data, with the remainder sometimes being lost in data silos or discarded in data graveyards. For instance, the customer service team might not receive client data from sales, missing an opportunity to build more precise and comprehensive customer profiles. Missing out on possibilities to develop novel products, enhance services and streamline procedures is caused by hidden data. z - Finding relevant data © Finding relevant data is not so easy. There are several factors that are to be considered to consider while trying to find relevant data, which include - © Relevant domain : © Relevant demographics © Relevant time period and so many more factors that one needs to consider while trying to find relevant data. Data that is not relevant to the study in any of the factors render it obsolete and one cannot effectively proceed with its analysis. This could lead to incomplete research or analysis, collecting data again and again or shutting down the study. Deciding the data to collect * Determining what data to collect is one of the most important factors while collecting data and should be one of the first factors while collecting data. One must choose the subjects the data will cover, the sources used to gather it and the required quantity of information. The responses to these queries will depend on the aims or what is expected to achieve Utilizing the data. As an illustration, one may choose to gather information on the categories of articles that website visitors between the ages of 20 and 50 most frequently access. One can also decide to compile data on the typical age of all the clients who made a purchase from the business over the previous month. TECHNICAL PUBLICATIONS® - an up-thnust for knowledgeData Exploration and Visualization (1-8) Expbratory Date Aaja, Dealing with big data a © Big data refers to exceedingly massive data sets with more pode and diversi structures. These traits typically result in increased challenges wits storing, analyz, using additional methods of extracting results. Big data refers especially to data sets that are quite enormous or intricate that conventional data processing tools are insufficient, overwhelming amount of data, both unstructured and structured, that a business faces op daily basis. + The amount of data produced by healthcare applications, the internet, social networking sites social, sensor networks and many other businesses are rapidly growing as a result of recent technological advancements. Big data refers to the vast volume of data created from, numerous sources in a variety of formats at extremely fast rates. Dealing with this kind of data is one of the many challenges of data collection and is a crucial step toward collecting effective data. 1.1.4 Exploratory Data Analysis (EDA) © Exploratory Data Analysis (EDA), a term defined by John W. Tukey, is a process of examining or understanding the data and extracting insights or main characteristics of the data. EDA is generally classified into two methods, that is, graphical analysis and non graphical analysis. ‘ ¢ “EDA” is a critical first step in analyzing the data from an experiment. Any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis. 4 © It is also known as visual analytics or descriptive statistics, It is the practice of inspecting and exploring data, before stating hypotheses, fitting predictors and otter expected inferential goals. It typically includes the computation of simple summary statistics which capture some property of interest in the data and visualization. EDA ca be thought of as an assumption-free, purely algorithmic practice, i Exploratory Data Analysis (EDA) is the crucial process of using summary statistics wd ‘SFaphical representations to perform preliminary investigations on data in order to uneO"™ Patterns, detect anomalies, test hypotheses and verify assumptions, * EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model, EDA is not a mere collect? of techniques; EDA is a philosophy that dissects a data set; what is to look fof = TECHNICAL PLBLICATIONS® an opeee 2 Data Analysis Data Exploration and Visualization (1-9) Exploratory to look; and how to interpret. It is true that EDA heavily uses ake collection of techniques that are called as "statistical graphics", but it is not identical to statistical graphics. « Basic’ purpose of EDA is to spot problems in data'(as part of data wrangling) and understand variable properties like, © Central trends (mean) © Spread (variance) ei Skew o Outliers and anomalies. low are the most prominent reasons to use EDA _ Detection of mistakes. 2, “Examine the data distribution, 3, Handling missing values of the dataset(a most common issue with every dataset) . oad Handling the outliers. 3 Remioving duplicate data, . Encoding the categorical variables. ae Normalizing and scaling. 8._ Checking of assumptions 9. Preliminary. selection of appropriate models. 10, Determining relationships among the explanatory variables. 11Assessing the direction and rough size of relationships between explanatory and outcome variables, 0 1.2 Understanding Data Science * Data science is the domain of study that deals with vast volumes of data using modem tools and techniques to find unseen patterns, derive meaningful information and make business decisions. Data science combines math and statistics, specialized programming, advanced analytics, Artificial Intelligence (Al) and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge eeeData Exploration and Visualization (1-19) Exploratory Data Analyai © Data science is the field of study that combines domain expertise, programming Skills ang Amowledge of mathematics and statistics to extract meaningful insights from data, Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio and more to produce Artificial Intelligence (AI) systems to perform tasks that ordinarily require human intelligence. In turn, these systems generate insights which analysts and business users can translate into tangible business value. The data used for analysis can come from many different sources and presented in various formats. The data science lifecycle © Data science’s lifecycle consists of five distinct stages, each with its own tasks : 1. Capture - Data acquisition, data entry, signal reception, data extraction - This stage | involves gathering raw structured and unstructured data. 2. Maintain - Data warehousing, data cleansing, data staging, data processing, data architecture - This stage covers taking the raw data and putting it in a form that can be used. 3. Process - Data mining, clustering/classification, data modeling, data summarization - Data scientists take the prepared data and examine its patterns, ranges and biases to determine how useful it will be in predictive analysis. 4, Analyze - Exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis - This stage involves performing the various analyses on the data. 5. Communicate - Data reporting, data visualization, business intelligence, decision making - In this final step, analysts prepare the analyses in easily readable forms such a8 charts, graphs and reports. Data science tools ¢ Below are various data science tools that can be used in various stages of the data science process. 7 © Data Analysis - SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner © Data Warehousing - Informatica / Talend, AWS Redshift ‘© Data Visualization - Jupyter, Tableau, Cognos, RAW. © Machine Leaming - Spark MLib, Mahout, Azure ML studio. TECHNICAL PUBLICATIONS®- en upto namadgoData Exploration and Visualization (1-11) Exploratory Data Analysis: Use of data science 1. Data science may detect patterns in seemingly unstructured or unconnected data, allowing conclusions and predictions to be made. ° 2, Tech businesses that acquire user data can utilize strategies to transform that data into valuable or profitable information. 3, Data science has also made inroads into the transportation industry, such as with driverless cars, It is simple to lower the number of accidents with the use of driverless cars. For “example, with driverless cars, training data is supplied to the algorithm and the data is examined using data’ science approaches, ‘such as the speed limit on the highway, busy streets, etc. 4, Data science applications provide a better level of therapeutic customization through genetics and genomics research. 5, Data science has found its applications in almost every industry. Applications of data science ; 1. Healtheare : Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases. Machine learning models and other data science components “are used by hospitals and other healthcare providers to automate X-ray analysis. and assist doctors in diagnosing illnesses and planning treatments based on previous patient outcomes. 2.' Gaming : Video and computer games are now being created with the help of data science and that has taken the gaming experience to the next level. 3. Image recognition : Identifying patterns in images and detecting eo in an image is one of the most popular data science applications. 4, Recommendation systems : Netflix ‘and -Amazon give movie and product recommendations based on what people like to watch, purchase or browse on their platforms. : Logistics : Data science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational efficiency. 6, Fraud detection ; Banking and financial institutions use data science and related algorithms to detect fraudulent transactions. TEGHNIGAL PUBLIGATIONS® - on up-trua for knowodeData Exploration and Visualization (1-12) ee a 7. Internet search : When one thinks’ of search, immediately Google comes into, However, there are other search engines, such as Yahoo, Duckduckgo, Bing, aor At and others, that employ data science algorithms to offer the best results for the se; query in a matter of seconds. Given that Google handles more nat 20 petabytes of day per day. Google would not be the ‘Google’, widely popular today, if data sviene di va exist. 8. Speech recognition : Speech recognition is dominated by data science techniques, Thee are excellent applications of these algorithms in daily lives. A virtual speech assistant like Google assistant, Alexa or Siri are few speech recognition applications to name. Its voice’ recognition technology is operating behind the scenes, attempting to interpret and evaluate the words and delivering useful results from the use. Image recognition may also_be seen on social media platforms such as Facebook, Instagram and Twitter. When one submits 4 picture of their own with someone onone’s list, these applications will recognise them and tag them. 7% 9. Targeted advertising : From displaying banners on various websites to digital billboards at airports, data science algorithms are utilised to identify almost anything. This’ is why digital advertisements have a far higher CTR (Call-Through Rate) than. traditional marketing. They can be customized based on a user's prior behaviour. That is why one may see adverts for data science training programs while another person. sees, all advertisement for clothes in the same region at the same time. cei 10. Airline route planning : As a result of data science, it is easier to predict gt delays for the airline industry, which is helping it grow. It also helps to determine whether to Jand immediately at the destination or to make a stop in between, such as a flight from Delhi to the United States of America or to stop in between and then arrive at the destination. 11. Virtual reality : A virtual reality headset incorporates computer expertise, algorithms and: data to create the greatest viewing experience possible. The popular game Pokemon GOs a minor step in that direction. The ability to wander about and look at Pokemon on wally streets and other non-existent surfaces. The makers of this game chose the locations of He Pokemon and gyms using data from ingress, the previous app from the same business + ‘ 12, To evaluate client in retail shopping : Retailers evaluate client behaviour and) trends in order to provide individualized product suggestions as well aS advertising, marketing and promotions. Data science also assists them in ™ man product inventories and supply chains in order to keep items in stock. TEGHNICAL PUBLCATIONS® an psa inowoogeData Exploration and Visualization (1-13) Exploratory Data Analysis 13. Entertainment : Data science enables streaming services to follow and evaluate what consumer’s views, which aids in the creation of new TV series and films. Data-driven algorithms are also utilised to provide tailored suggestions based on the watching history of auser. : i 14, Finance : Banks and credit card firms mine and analyse data in order to detect fraudulent activities, manage financial risks on loans and credit lines and assess client portfolios in order to uncover upéelling possibilities. 15. Manufacturing : Data science applications in manufacturing include supply chain ‘management and distribution optimization, as well as predictive maintenance to anticipate probable equipment faults in facilities before they occur. 1 1.3 Significance of EDA Exploratory Data Analysis (EDA) involves using statistics and visualizations to analyze and identify trends in data sets. The primary intent of EDA is to determine whether a predictive model is a feasible analytical tool for business challenges or not. ¢ EDA helps data scientists gain an understanding of the data set beyond the formal modeling or hypothesis testing task. Exploratory data analysis is essential for any research analysis, so as to gain insights into a data set. Importance of using EDA for analyzing data sets is, * EDA helps identify errors in data sets. EDA gives a better understanding of the data set. EDA helps detect outliers or anomalous events. EDA helps to understand data set variables and the relationship among them, Different fields of science, economics, engineering and marketing accumulate and store data primarily in electronic databases. Appropriate and well-established decisions should be made using the data collected. ‘© Itis practically impossible to make sense of datasets containing more than a handful of data points without the help of computer programs. To be certain of the insights that the collected data provides and to make further decisions, data mining is performed where there are distinctive analysis processes. * Exploratory data analysis is key and usually the first exercise in data mining, It allows us to visualize data to understand it as well as to create hypotheses for further analysis. The exploratory analysis centers around creating a synopsis of data or insights for the next steps ina data mining project. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeData Exon ad Vutatin (544) Spon ground truth about the content without making” any’ jentists use this process to actually 1 understand, be created. Key components of exploratory and visualization of data : ‘© EDA’ actually reveals * assumptions. This is the fact that data sci type of modeling and hypotheses can analysis include summarizing data, statistical analysis Python provides expert tools for exploratory analysis, with pandas for summarizing along with others, for statistical analysis; and matplotlib and plotly for visualizati Data scientists can use exploratory analysis to ensure the results they Produce are vali applicable to any desired business outcomes and goals. EDA also helps: stakehol confirming they are asking the right questions. EDA can help answer questi standard deviations, categorical variables and confidence intervals. Once EDA i and insights are drawn, its features can then be used for more sophisticated data modeling, including machine learning. Importance of EDA in Data Processing and Modeling © Outliers or abnormal occurrences in a dataset can have an impact on the machine learning models. The dataset might also contain some missing or duplie Data Cleaning and Preprocessing, * Data preprocessing and cleansing are critical components of EDA. P Yarables and the stucture ofthe dataset is the initial stage inthis proce then be cleaned. The dataset may contain Tedundancy such as irregular dat or outliers that may cause the model to ov Data > Model -> Analysis -> Conclusions i “Exploratory data analysis follows below steps, Problem > Data > Analysis -> Model > Conclusions <2Bayesian data analysis follows below steps, Problem -> Data > Model > Prior Distribution > Analysis -> Conclusions © ‘That is the in classical analysis, the data collection is followed by the imposition of a model (normality, linearity, etc.) and the analysis, estimation and testing that follows are focused on the parameters of that model. For EDA, the data collection is not followed by a model imposition; rather it is followed immediately by analysis with a goal of inferring what model would be appropriate. * For a Bayesian analysis, the analyst attempts to incorporate scientific/engineering knowledge/expertise into the analysis by imposing a data - independent distribution on the parameters of the selected model; the analysis thus consists of formally combining both the prior distribution on the parameters and the collected data to jointly make inferences and/or test assumptions about the model parameters. In the real world, data analysts freely mix elements of all of the above three approaches and if required other approaches as well. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge tsData Exploration and Visuatzation eats) Exploratory Classical Analysis Vs EDA Classical analysis approach and EDA can be compared 1. Model ‘a ‘© The classical approach imposes models (both deterministic and probabilistic data, Deterministic models include, for example, regression’ models and ana) variance (ANOVA) models. The most common probabilistic model ass errors about the deterministic model are normally distributed - This assumptig the validity of the ANOVA F tests. © The exploratory data analysis approach does not impose deterministic oj models on the data. On the contrary, the EDA approach allows the dat admissible models that best fit the data. eg 2. Focus on the basis of below ‘© The two approaches differ drastically in focus. For classical analysis, the model estimating parameters of the model and generating predicted model. 4 For exploratory data analysis, the focus is on the data - Its structure, 0 suggested by the data. ’ 3. Techniques © Classical techniques are generally quantitative in nature. They include chi-squared tests and F tests. EDA techniques are generally graphica Scatter plots, character plots, box plots, histograms, bi-histograms, residual plots and mean plots. 4. Rigor * Classical techniques serve as the probabilistic foundation of science the most important characteristic of classical techniques is that they are. and "objective". EDA techniques do not share in that rigor or formal make up for that lack of rigor by being very suggestive, indicative and. What the appropriate model should be. EDA techniques are subjective an interpretation which may differ from analyst to analyst, although exp commonly arrive at identical conclusions, :cies ee ae (1-19) Exploratory Data Analysis Data Exploration and Visualization 5, Data treatment al estimation techniques have the characteristic of taking all of the data and mapping the data into a few numbers ("estimates"). This is both a virtue and a vice, The virtue is that these few numbers focus on important characteristics (location, variation, etc.) of the population. The vice is that concentrating on these few characteristics can autocorrelation, etc.) of the same © Classic: filter out other characteristics (skewness, tail length, population, In this sense there is a loss of information due to this “filtering” process. Whereas, the EDA approach often makes use of (and shows) all of the available data. In this sense there is no corresponding loss of information. 6. Assumptions The classical approach tests based on classical techniques that are usually very sensitive- that is, ifa true shift in location, say, has occurred, such tests frequently have the power to detect such a shift and to conclude that such a shift is "statistically significant". But classical tests depend on underlying assumptions (e.g., normality) and hence the validity of the fest conclusions. becomes dependent on the validity of the underlying assumptions. Further the issue is, the exact underlying assumptions may be unknown to the analyst or if known, untested. Thus the validity of the scientific conclusions becomes intrinsically linked to the validity of the underlying assumptions. In practice, if such assumptions are unknown or untested, the validity of the scientific conclusions becomes suspect. Many EDA techniques make little or no assumptions they present and show the data - all of the data - as is, with fewer encumbering assumptions, 1 1:7 Software Tools for EDA Python, R, Excel are some of the popular EDA tools. 1, R- An open-source programming language and free software environment for statistical computing and graphics supported by the R foundation for statistical computing. The R language is widely used among statisticians in developing statistical observations and data analysis, ¢ 2. Python - An interpreted, object-oriented programming language with dynamic semantics, Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, take it very attractive for rapid application development, as well as for use as a scripting or glue ‘language to connect existing components together, Python and EDA can be used together to identify missing values in a data set, which is important so one can decide how to handle missing values for machine learning. is TECHNICAL PUBLICATIONS® - an up-tst for owedge TR» Exploratory Data Ai Data Exploration and Visualization (1-20) oe inalysis EDA can be done using python for identifying the missing value in a data set. Other functions that can be performed are the description of data, handling outliers, getting insights through the plots. Its high-level, built-in data structure and dynamic typing and binding make it an attractive tool for EDA. Analyzing a dataset is a hectic task that takes a lot of time. Python provides certain open-source modules that can automate the whole ~ process of EDA and help in saving time. » . Excel / Spreadsheet : A very long running and dependable tool excel remains an indispensable part of analytics industry. Most of the problems faced in analytics projects are solved using this software. It supports all the important features like summarizing data, visualizing data, data wrangling etc. which are powerful enough to inspect data from all possible angles. Microsoft Excel is paid but there are various other spreadsheet tools like open office, google docs, which are certainly worth using. > Weka - Weka is an easy to learn, machine learning tool, having an intuitive interface to get the job done quickly. It provides options for data pre-processing, classification, regression, clustering, association rules and visualization. Most of the steps while model building can be achieved using Weka. It is built on Java. nw Tableau public - Tableau is a data visualization software. Its a fast visualization software which allows exploring of data, every observation using various possible charts. it is intelligent algorithms figure out by self about the type of data, best method available etc. For understanding data in real time, tableau can get the job done. In a way, tableau imparts a colorful life to data and allows sharing work with others. 3 £0 1.8 Visual Aids for EDA 1. Univariate Plots (Used for univariate data - the data containing one variable) * Univariate plots show the frequency or the distribution shape of a variable. Below are visual tools used to analyze univariate data. q i. Histograms * Histograms are two-dimensional plots in which the x-axis divides into a range of numerical bins or time intervals. The y-axis shows the frequency values, which are counts of occurrences of values for each bin. Bar graphs have gaps between the bars to indicate that they compare distinct groups, but there are no gaps in histograms. Hence, they convey if the distribution is left/positively skew (most of the data falls to the right side), right/negatively skewed (most of the data falls to the left side), bi-modal (graphs having two distinct peaks), normal (perfectly symmetrical without skew) or uniform (almost all the bins have similar frequency), TECHNICAL PUBLCATIONS® an up tt fo inowoageData Exploration and Visualization (1-21) Exploratory Data Analysis. « Example : Below is the waiting time of the customer at the cash counter of the grocessary shop during peak hours, which was observed by the cashier. 230 5.00 3.55 2.50 5.10 421 - 3.33, 4.10 2.55 5.07 3.45 a 4.10 5.12 © For this data a histogram can be created using five bins with five different frequencies, as seen in the chart below. On the Y-axis, it’s the average number of customers falling in that particular category. On the X-axis, there isa range of waiting times. For exainple, the 1” bin range is 2.30 mins to 2.86 mins. It can be noted that the count is three for that category from the table and as seen in the below graph. Chart title (2.3, 2.86) [286,343] [3.43,3.99] (3.99, 4.56) (4.56, 5.12) Fig. 1.8.1 Histogram TECHNICAL PUBLICATIONS® - an up-thrust for knowiedye.(1-22) Exploratory p type of distribution that has Several pe oye © It is a random distribution, which is @ lacks an apparent pattern. i. Distplot ‘© Distplot is also known of the histogram. Distplot gives us @ as the second histogram because it is a slight improyeg KDE(Kemel Density Estimation) over Which explains PDF(Probability Density Function) which means what is the of each value occuring inthis column. Below is the distplot of distribution g news channel is most active with new additions over the year, 0.30 is 0.25 020 z Be eee 0.10 . : 0.05 0.00 1940 1960 ©1980-2000 Release_year Fig. 1.8.2 Distplot iii, Line chart connects a series of data poins with a continuous line. This is the chart used in finance and it typically only depicts a security's closing | Line charts can be used for any ti us L any timeframe, but they most often ua Data Exploration and Visualization (1-23) Exploratory Data Analysis . There are different types of line charts; They are : © Line chart - It shows the trend over time (years, months, days) or other categories. It is used when the order of time or types is important. Line chart with markers - It is similar to the line chart, but it will highlight data points with markers. ° ° Stacked line chart - This is a line chart where lines of the data points do not get overlapped because they will be cumulative at each point. Stacked line chart with markers - This is similar to the stacked line chart but will highlight data points with markers. ° ° 100 % stacked line chart - It shows the percentage contribution to a whole-time or category. © 100 % stacked line chart with markers - This is similar to a 100 % stacked line chart, but markers will highlight data points, © Below line chart shows number of houses sold in particular months. y Number of houses sold Jan . Feb Mar Apr May Jun Months Fig. 1.8.3 Lineplot v, Stacked area plotichart * Anarea chart combines the line chart and bar chart to show how one or more group's numerio values change over the progression of a second variable, typically that of time, “An area chart is distinguished from a line chart by the addition of shading between lines and a baseline, like in a bar chart." TECHHIGAL PUBLICATIONS® en upto knoe-24) Data Exploration and Visualization it © Stacked form of several area series sta chart is plotted in the fort stacked on area one another. The height of each series is determined by the value in each data p typic case for stacked area charts is analyzing how each of several variables ang ical use their totals vary, on the same graphic. Intervals ——= Fig. 1.8.4 Stack area chart v. Table chart ¢ A table chart is a means of arranging data in rows and columns. The pervasive throughout all communication, research and data appear in print media, handwritten notes, computer sofware, “ ornamentation, traffic signs and many other places, 'Data Exploration and Visualization (1-25) Exploratory Data Analysis yi. Probability distribution plots oa ‘6 Probability distributions aré mathematical functions that describe all the possible values © ‘that“a random variable can'assuime within''a ‘given range. They help model random ° © phenomena, allowing us in ofder to estimate’ the’ probability of a particular event. This type of distribution is helpful to know the likely outcomes and the spread of potential values. sigh «.«) Forasingle random variable, probability distributions can be divided into two types » # opypes P80 0d flo tees omee atavesiiot be b ott z © Discrete Probability distributions for discrete variables : Also known as ‘probability mass functions, the random variable can assume a discrete number of values like the number of reviews; it can be 100 or 101, but nothing in between. It ‘returns’ probabilities; Hence the’ output is between O and 1: Theré area variety of 1s (discrete probability distributions that can be ised to model different types of data.» © Binomial distribution : There are two possible outcomes in this distribution - success or failure and multiple trials are carried out. The probability’ of success and failure ig the same for all tials. The sum of all probabilities must equal one. Succiss ‘probability : So, let’s say that there is a success! probability of.0.8 of manufacturing a perfect car engine part. What is the probability, of having seven successes in 10 attempts ? The probability of success is 0.8 and failure is 0.2. The number of trials is ten/and the eee is 7. S, ( Probability density functions for continuous wat ables ¢ | Also known as probability density functions, the random variable can assume an infinite: number of values between any two values; like “a can take any value like 45.3, 45,36.'45.369 or 45.3698 and so ons) 129 « a 1 siuaucs fot continuous distributions are measured over Tanges of values rather than Single points. A probability’ indicates the’ likeliKded ‘that value’ will fall within an interval. ‘The entite! area ‘under’ the ‘distribution curve equals 1. For instance, the “proportion Gf the ‘area under the ‘curve that falls’ Within a range of values along the X-axis is the likelihood that a value will fall within that ee TEOHNIGAL PUBLCATIONS® can pinto knowiage rsData Exploration and Visualization \ction ae can ee of adul’s heights in a town and the aa flloy : eles equals 5.5 and the standard deviation is 1. The a shows the probability that a randomly picked person ’s height will be and is approximately equal 1 0.15 or 15 %. a vii. Run sequence plots «A run chart, also known as a run-sequence plot, displays observed data sequence. So, often, the data displayed represents some aspect of a bi output or performance. It is, therefore, a form of a lin char. They are often order to locate anomalies in data that suggest shifts in a process Over tithe, location and scale and outliers can easily be detected. ’ 2. Bivariate Plots (Used for bivariate data - the data containing two v. « Bivariate plots display the relationship between two variables in| expl analysis. i. Bar graphs recognizing trends. ii. Scatter plots © Scatter plots are commonly used in statistical analysis in order to relationships. So, they are used in order to determine whether two m correlated by plotting them on the x and y-axis. They are suitable for r iii, Box plots ‘ © These charts show the distribution of values along:an axis. in order to bucket the data, giving us an idea of how the data These boxes are also called quartiles which represent a quarter of @ be drawn vertically or horizontally. One can also easily ie their very high or low values. ‘© Box plots are suitable for identifying outliers,a Te Data Exploration and Visualization (1-27) Exploratory Data Analysis s iv. Correlation plots (Heat maps) For instance, correlation heat maps show the interrelationship between variables - areas as shaded as per the data’s values. So, Color differences can easily spot similar and different values and make sense of the data variation. They are usually helpful when there is a large amount of data. They are used during A/B testing to see which parts of a web page are accessed by users on a website. The number of reviews generated every hour or to analyze a cricket match to understand where a batsman is scoring the bulk of his runs or where the bowler is pitching his ball. y. Cluster map ‘One can also use a cluster map to understand the relationship between two categorical variables. A cluster map basically plots a dendrogram that shows the categories of similar behavior together. Special purpose plots i. Pair plots © Pair plots are a simple way in order to visualize relationships between multiple variables. So, It produces a matrix of relationships between variables in the data for a direct examination of the data. * This plot shows how registered and casual users are using bike rentals. It also shows the effect of temperature, humidity and wind speed on bike rentals. This gives an overview of the correlation between multiple variables. fi, Contour plots * The contour plot can be used for representing a 3D surface in a 2D format. Contour Plots are generally used for continuous variables rather than categorical data, The contour maps are inspired by seismic data analysis, They can explain where the data density is high, explore deep learning error functions or gradient analysis. ili, Density plots oA density plot is a smoothed, continuous version of a histogram estimated from the data. (The most common form of estimation is the kernel density plot. In this method, a continuous curve (the kernel) is drawn at every individual data point, All of these curves are then combined to make a single smooth density estimation. TECHNICAL PUBLICATIONS® an up-thnist for knowledge SEEData Exploration and Visualization the probability density function for the key is i ity plot is © So, The y-axis in a density p! The difference is that the probability estimation and not a probability. probability per unit on the x-axis © While comparing the distributions of have issues with readability. Density P iv. Polar / Spider / Radar charts ik Sa «Polar charts are circular charts that use values and angles to show informatj polar coordinates. Polar charts are useful for showing scientific data, One: default measure. Pi © A polar chart is a graphical way of displaying multivariate data of : {quantitative variables represented on axes starting from the same point. It demonstrate a dominant variable. I ie ‘one variable across multiple categori ots are useful in this scenario. sia * Polar chart is a common variation of circular graphs. It is useful when between data points can be visualized most easily in terms of radians and ai + In polar charts, a series is represented by a closed curve that connects points coordinate system. Each data point is determined by the distance-from th radial coordinate) and the angle from the fixed direction (the angular v. Lollipop chart © The lollipop chart is a composite chart with bars and circles. It isa chart with a circle at the end, to highlight the data value. Like.a bar chart is used to compare categorical data. For this kind of composit more visual elements to convey information. 0d * Lollipop Chart (LC) is a handy variation of a bar chart where the bar is line and a dot at the end. Just like bar graphs, lollipop plots comparisons between different items or categories. They are also used for showing trends over time. Only one numerical yariable is co category. They are not suitable for relationships, distribution or compos i © Lollipop charts are two-dimensional with two axes : One axis shows cate series, the other axis shows numerical values, ‘ LCs are preferred to bar charts when one i um values. In that case with a standard TRL i experience an optical effect called a Moiré pattern (The Mi perception that occurs when viewing a set of lines or bars another set of lines or bars, where the sets differ in relative size,eae re i ae en ae Data Exploration and Visualization (1-29) Exploratory Data Analysis vi. Lag plots A relationship between an observation and the previous observation is beneficial in time series modeling. Previous observations in a time series are lags, with the observation at one previous time step. It is known as lagl, the observation at two previous steps lag 2 and so on, A lag plot is a useful type of plot in order to explore each observation’s relationship and a lag of that observation and is displayed as a scatter plot. If the points cluster along a diagonal line from the bottom-left to the plot’s top-right, it suggests a positive correlation relationship. If the points cluster along a diagonal line from the top-left to the bottom-right, it means a negative correlation relationship. ‘© Lag plots can help compare observations simultaneously in the last week or last month ot the previous year by using corresponding lag values. «The plot here shows the count of bike rentals compared to the previous day's count and it displays a relatively strong positive correlation. vil. Auto-correlation plots © The correlation between observations and their lag values/in a time series name autocorrelation, Correlation coefficients are plotted on an autocorrelation plot. ° A correlation coefficient is a correlation value between observations: and their lag 1 values and results in a number between ~ 1 and +1. A value close to zero suggests a Weak correlation, whereas a value closer to 1 or 1 indicates a strong correlation. It helps better understand how this relationship changes over the lag, It shows the lag on the x-axis and the correlation on the y-axis. Vii. Lognormal plots * A normal distribution can be converted to a lognormal distribution using logarithmic mathematics. The lognormal distribution plots the-log of random variables from a normal distribution curve. It displays the Probability Density Function (PDF) and is of Particular interest when the variable must be positive as log values are always positive. . Many, examples follow lognormal distribution like the concentration of elements and their radioactivity in the Earth’s crust, latent periods of infectious diseases, the distribution of particles, chemicals and organisms in the environment, the length of ‘comments posted on social media website discussion forums or fluctuations in the stock markets. . . "TECHNICAL PUBLICATIONS® . en up-thrust for knowedge> x, Violin plot © A violin plot is a hybrid of @ ‘box plot and a kernel density plot, which sh distribution of numerical data. Unlike a box « the data. It is used to visualize the " aaa can only show summary statistics, violin plots depit summary statistics of each variable. x Joint plot ee While the pair plot provides a visual insight into all possible correlations, provides bivariate plots with univariate marginal distributions, es xi. Pie chart © The pie chart is also the same as the countplot, only gives additional i the percentage presence of each category in data which means category much weightage in data. , Multivariate Visualization (Used for multivariate data - The data than two variables) ‘ ‘© When dealing with multiple variables, it is tempting to make three Such data can be viewed with scatter plots of the relation between each © Combined charts also are handy ways to visualize data, since eachel understand on its own. component analysis or other techniques and then make a plot of the This is particularly important for high dimensionality data and has learning such as visualizing natural language or images, Text data For example with text data, one could create a world cloud, where the is based on its frequency in the text. To remove the words ¥ dataset, the documents can be grouped using topic modeling and words can be displayed. i Image data ¢ When doing image classification, it is common to use des dimensionality of the data. * Instead of blindly using decomposition, a data scientist could plot the (© By looking at the contrast (black and white) in the images, one | a importance with the locations of the eyes, nose and mouth, all shape. ZData Exploration and Visualization (1-31) Exploratory Data Analysis 2 1.9 Data Transformation Techniques © Dati transformation techniques refer to all the actions that help to transform the raw data /~ into a clean and ready-to-use dataset. © There are different types of data transformation techniques that offer a unique way of transforming the data and there is a chance that there will not be a need for all these techniques on every projects. Below are basic data transformation techniques that can be used in analysis project or data pipelines. 4. Data smoothing © Smoothing is a technique where an algorithm is applied in order to remove noise from the dataset when trying to identify a trend: Noise can have a bad effect on the data and by eliminating or reducing it one can extract better insights or identify patterns that would not be seen otherwise. ¢ There are three algorithm types that help with data smoothing : © Clustering : Where one can group similar values together to form a cluster while labeling any value out of the cluster as an outlier. © Binning : Using an algorithm for binning will help to split the data into bins and smooth the data value within each bin. © Regression : Regression algorithms are used to identify the relation between two dependent attributes and help to predict an attribute based on the value of the other. 2. Attribution construction * Attribution construction is one.of the most common techniques in data transformation pipelines. Attribution construction or feature construction is the process of creating new features from a set of the existing features/attributes in the dataset. Imagine working in marketing and trying to analyze the performance of a campaign. One may have all the impressions that the campaign generated the total cost for the Biven time frame. Instead of trying to compare these two metrics across all of the campaigns, one ‘can construct another metric to calcul gomiie late the cost ili impressions or CPM. 5 eae Pani Fig. 1.9.1 Attribution construction TECHNICAL PUBLICATIONS® . an up-trust for knowledge re SCjg process a lot easier, 8S One can op . This wll make data cxloton nda pra mete, 3 =. single campaign performance on a single met" 3. Data i 3 jata generalization sc rasfoming low level athibiteg all ©" Data generalization refers to the process © ae Tevel ones. by using the concept of hierarchy. Data ee is a categorical data where they have a finite but large number of distinct values, © This is something that everyone is already doing ee and it helps t clearer picture of the data. Suppose there are four categorical attributes in the q 1. City 2, Street 3. Country 4. State/ province. © One can define a hierarchy between these attributes by specifying the i among them at the schema level, for example : ; street < city < state/province < country. 4. Data aggregation aig © Data aggregation is possibly one of the most popular techniques in data transforma ‘When data aggregation is applied to the raw data, it is essentially a process of g and presenting data in a summary format. re ° This is ideal when one wants to'perform statistical analysis over the data as ¢ want to aggregate the data over a specific time period and provide statisti average, sum, minimum and maximum. a «For instance, raw data can be aggregated over a given time period to pi such as average, minimum, maximum, sum and count. After the data is Tesources or resource groups. There are" two types of data aggregation and spatial aggregation. 5. Data discretization can intervals. This is an especially useful technique that can help to ma easier to study and analyze and improve the efficiency of any applied algorit + Imagine having tens of thousands of rows representing people in a survey their first name, last name, age and gender. Age is a numerical attribute that cal Jot of different values. To make it easier the Tange of this continuous attr divided into intervals, TECHNICAL PUBLICATIONS® an uptnt for knowledgelt ts—— ‘pata Exploration and Visualization (1-39) Exploratory Data Analysis «Mapping this attribute toa higher-level concept, like youth, middle-aged and senior, can help a lot with the efficiency of the task and improve the speed of the algorithms applied. «There are a wide variety of discretization methods starting with naive methods such as equal-idth and equal-frequency to much more sophisticated methods such as MDLP. 6. Data normalization © Data normalization is the process of scaling the data to a much smaller range, without losing information in order to help minimize or exclude duplicated data and improve algorithm efficiency and data extraction performance. There are three methods to normalize an attribute : © Min-max normalization : Where one performs a linear transformation on the original data. et ©. Z-score normalization : In z-score normalization (or zero-mean normalization) one is normalizing the value for attribute A using the mean and standard deviation. Decimal scaling : Where one can normalize the value of attribute A by moving the decimal point in the value. Normalization methods are frequently used when there are values that skew the dataset and it is hard to extract valuable insights. 7. Integration © Data integration is a crucial step in data pre-processing that involves combining data residing in different sources and providing users with a unified view of these data. It includes multiple databases, data cubes or flat files and works by merging the data from various data sources. There are mainly two major approaches for data integration : Tight coupling approach and loose coupling approach. 8. Manipulation * Data manipulation is the process of changing or altering data to make it more readable and organised, Data manipulation tools help identify patterns in the data and transform it into a usable form to generate insights on financial data, customer behaviour etc. 1.10 Merging Database (Using Pandas Library) © Pandas is a software library written for the Python programming language for data manipulation and analysis. : TECHNICAL PUBLICATIONS® en upainat or nowedboData Exploration and Visualization andas are power tools for exploring and apy a multifaceted approach to combining in and concatenate datasets, allowing to Unify © The Series and DataFrame objects in p data. Part of their power comes ee datasets. With pandas, one can arr joi : better to understand the data being analyze The Pandas DataFrame is a structure that contains ene data ang corresponding labels. Pandas DataFrame is a two-dimensioné ie Poteng heterogeneous tabular data structure with labeled anes (roms and col tumns). Datar, are widely used in data science, machine learning, scientific computing and many data-intensive fields. 7 DataFrames are similar to SQL tables or the spreadsheets that one works with in Byog Cale. In many cases, DataFrames are faster, easier to use and more powerful than table spreadsheets because they are an integral part of the Python and NumPy ecosystems, * A Pandas Series is like a column in a table. It is a one-dimensional array holding data g any type. Bb 1.10.1 Pandas Merge(): Combining Data on Common Columns or Indices © Merge()can be used when functionality similar to a database's join operations is required, is the most flexible operation that can be applied to data. © When one wants to combine data objects based on one or more keys, similar to what done in a relational database, merge() i the tool one can use, More specifically, merge) most useful when one wants to combine rows that share data. q One can achieve both many-to-one and many-to-many joins with merge(), In a mal one join, one of the datasets will have many rows in the merge column that repeat the sam values. For example, the values could be 1, 1, 3, 5 and 5: At the same time, the met Column in the other dataset won't have repeated values, Take 1,3 and 5 as an example * As it can be seen, in a many-to-many join, both of the merge columns will have re values. These merges are more complex and result in the cartesian product of the j rows. : This means that, after the merge, same value in the key column, © What makes merge() so flexible is the sheer the merge, there will be every combination of rows that umber of options for defining theData Exploration and Visualization (1-35) Exploratory Data Analysis Formerge() two arguments are required, 1, The left DataFrame 2. The right DataFrame. After that, it requires other optional arguments mentioned below, to define how the datasets are merged. co how defines what kind of merge to make. It defaults to ‘inner’, but other possible options include ‘outer’, ‘left’ and ‘right’. _ © ontells merge() which columns or indices, also called key columns or key indices, one wants to join on. This is optional. If it isn’t specified and left_index and right_index (covered below) are False, then columns from the two DataFrames that share names will be used as join keys. If on is used, then the specified column or index must be present in both objects. ©. left_on and right_on specify a column or index that’s present only in the left or right object that is being merged. Both default to None. é © left_index and right index both default to False, but if it uses the index of the left or right object to be merged, then one can set the relevant argument to True. © suffixes is a tuple of strings to append to identical column names that aren’t merge keys. This allows us to keep track of the origins of columns with the same name. © These are some of the most important parameters to pass to merge(). Using merge() + Before getting into the details of how to use merge(), one should first understand the various forms of joins : © Inner © Outer © Left o Right TEGHNIGAL PUBLCATIONS® en up ua or nowedpeData Exploration and Visualization (1-36) Outer join Fig. 1.10.1 Merge and Joins 4s and the labels point to Which part or parts ‘* In this image, the two circles are the two dataset of the datasets is expected to be seen. @ 1.10.2 Pandas Jjoin(): Combining Data on a Column or Index ©. While merge is a module function, ,join() is an Instance method that’ Ii DataFrame, This enables you to specify only one DataFrame, . which . will ide DataFrame call join() on. ¢ Under the hood, ,join() uses eee but it provides a more efficient way to joi . DataFrames than a fully specified merge() call. ae Using Join() © By default, join() will attempt to do a left join on indies If one wants to. join similar to merge(), then one needs to set the columns as indices. e ‘© Like merge(), ,join() has a few parameters that give more flexibility in the joins operation However, with join(, the list of parameters is relatively short, iol alee © Other is the only required parameter. It defines the other DataFrame to join, One ot ¢ an also specify a list of DataFrames here, allowing to combine a number of datasets ina single join() call. 3 d 0 On specifies an optional column or index name for the left DataFrame to join the 0 DataFrame’s index. If it’s set to none, which is the default, then one will i on-index join. o How has the same options as how from merge(). The difference is that it’s index unless the columns are also specified, TECHNICAL PUBLICATIONS®- an up-thrust forDate Exploration and Visualization (1-37) Exploratory Data Analysis, 6 Isuffix and rsuffix are similar to suffixes in merge(). They specify a suffix to add to any overlapping columns but have no effect when passing a list of other DataFrames. ( Sort can be enabled to sort the resulting DataFrame by the join key. Example Program = 1 ,Th © default ih val 4 Value of 1 Will conca index" Frame objects to be Ie ¥8 Will be used tg lc is 0, Which tenate OF "co}fora. Pata Exploration and Visualization (1-40) Exploratory Data’Anay ‘frames = jart/am)~ Fesult = pd.concet(frames) Example Program -2 Output nin A 0 a id Name” 0 To1 siya 1702 Riya 2 703 Maya 9 704 Piya 0 ROS Raam ‘1 ROG Raa) Nasz : ‘Saal . ed aS. Ee (2) 1.11 Reshaping and Pivoting @ 1.11.1 Joining and Splitting Data - meit(), split(), pivot() In Pandas data reshaping means the transformation of the structure of a table or vector (.¢) DataFrame or Series) to make it suitable for further analysis. Some of Pandas reshapiy capabilities do not readily exist in other environments (¢.g. SQL or bare bone R). ‘ Pandas has two methods namely, melt( and pivot(, to reshape the data. melt() ‘© This method flattens/melts tabular data such that the specified columns and their respect values are transformed into key-value pairs. ‘keys’ are the column names of the: datase before transformation and ‘values’ are the values in the respective columns, transformation, ‘keys’ are stored in a column named ‘variable’ and ‘values’ are stored it another column named ‘value’, by default. The columns of the data frame are transformed into key-value pairs. Unpivot a DataFrame from wide format to long format, optionally leaving ‘dentine variables set. This function is useful to massage a DataFrame into a format where one or more column are identifier variables (id_vars), while all other columns, considered measured variable: (value_vars), are “unpivoted” to the row axis, leaving just two non-i ridentifi ier columt ‘variable’ and ‘value’. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeee Me ROMER ORR Wi ey uae ST | ole Bee ‘Data Exploration and Visualization (1-41) Exploratory Data Analysis ‘Syntax pandas.melt(frame, id_vars-None, value_vars=None, var_name=None, value_name='value’, col_level=None) parameters : frame ~ DataFrame DataFrame - Contains | Required Z list, numbers, strings id_vars | Column(s) to use as identifier tuple, list or ndarray | Optional variables, Column(s) to unpivot. If not tuple, list orndarray | Optional specified, uses all columns that are __| not set as id_vars. ‘Name to use for the ‘variable? scalar Required column, If None it uses frame.columns.name or ‘variable’. yalue_name | Name to use for the ‘value’ column, | scalar, default ‘value’ - | Required col_level | If columns are a Multilndex then use | int or string Optional } this level to melt. Returns : Unpivoted DataFrame. Example Program = 3 Example Program - 3 OutputData Exploration and Visualization {1 variable value 01 Nome Ravi pla) “Name! Anil 23 Namo Ane 91 Rolo CEO (42 Role Baltor 63 Role Author pivot() a e This method does the reverse of what melt() did. It transforms the key-value pairs into columns. vn index / column values. DataFrame organized by BiVe column values. iting DataFrame. This function'does no Multilndex in the columns. It returns a reshaped « Reshape data (produce @ specified index / columns 10 form axes of the resul support data aggregation, multiple values will result ina # pandas.pivot(data, index=None, columns=None, valuesNone)[source] Parameters + “pivot” table) based on Description DataFrame - Contains list, numbers, strings String or object Optional Data DataFrame Index Column to use to make a new frame’s index, If None, uses existing index. String or object Required Columns | Column to use to make new frame's columns. Values | Column(s) to use for populating new String, object or a list | Optional frame's values, If not specified, all of the previous remaining columns will be used and the —_ result will have hierarchically indexed ogi columns, f Returns ; Returns reshaped DataFrame. Ralses : : it ; ValueError : When there are any index, column combinations with multiple value: TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeexample Program - 4 Program Output - 4 TECHNICAL PUBLICATIONS® - an up-thrust for knoData Exploration and visualizato? es 4.14.2. Transformation Techniay 4. Grouping data sets dota a ing tO the categories ang © Pandas groupby is used for re Bees to ogre? data efficiently. function tothe cee Lanes rit the data into gro ni yo fantion 8 OS is ae axes. The abstract de = Pee Se i be split om any of theit def iteria. a names. ai in eto provide a mappine of Isbes to BO"? sions 1 a es is a very powerful function en a lot of variations. Tt Males the © groupby() is 4 as i splitting the dataframe over some criteria really’ e e ‘Any groupby operation involves one of the following They are, © Splitting the object © Applying a function o Combining the results. and efficient. erations on the ori Syntax © DataFrame.groupby(by-None, axis-O, level-None, as_index=True, group_keys=True, squeeze=False, **kwargs) Parameters : By: mapping, function, str or iterable © Axis : int, default 0 7 group by a particular level or © Level : Ifthe axis is a Multilndex (hierarchical), . ‘As index : For aggregated output, retum an object with group labels as the ind relevant for DataFrame input. as index ~ False is effectively “SQL-style * output . Sort : Sort group keys. Get better performance by turning this off. Note influence the order of observations within each group. groupby preserves tl rows within each group. _ — © Group-keys : When calling apply, add group keys to index to identify pieces © Squeeze : Reduce the dimensionalit ; Sq ae the dimensionality of the return type if possible, Returns : GroupBy object c PUBLICATIONS? - on upthrust for knowedgta Exploration and Visualization (1-45) Exploratory Data Analysis In many situations, the data is split into sets and some functionality can be applied on each subset. In the applied functionality, one can perform the following operations. o Ageregation - Computing a summary statistic o Transformation - Perform some group-specific operation o Filtration - Discarding the data with some condition. : Split data into groups ¢ Pandas objects can be split into any of their objects. There are multiple ways to split an object like - ‘0 obj.groupby(‘key') © +0 obj.groupby([keyl'key2')) 0. obj. groupby(key,axis=1) , Data aggregations ks © An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the . grouped data. a Applying multiple aggregation functions at once «With grouped series, one can also pass a list or dict of functions to do aggregation with and generate DataFrame as output . Transformations ‘e Transformation on a group or a column retums an object that is indexed the same size as that of that is being grouped. Thus, the transform should return a result that is the same size as that ofa group chunk. , j. Filtration= . Filtration filters the data on a defined criteria and returns the subset of data. The filter) bs function is used to filter the data. Example Program - 5 TECHNICAL PUBLICATIONS® - an up-hrust for knowledge‘Points’, [876.720 6, RRC Diag cece ehh mb a en -789,863,673,741,812,756,788,694, 701,804,690!) df= Pd.Dat: Print (ap aframe(punepl_data) Prin as Phy(Team') groups) #iterating through groupe Stouped = df.groupby(Year) fername,group in grouped: Print (name) i Print (group) applying Grouping and then applying aggregate function ‘Stouped = af groupby(Year) Print (grouped{Points'].agg(np.mean)) “applying multiple aggregate functions ‘grouped = df.groupby(Team) Print (grouped[Points'}.agg({ap.sum, np.mean, np.std))) #applying filtration # filter condition, retuma the teams which have participated four or more times in PL print (dfgroupby(Team) filter(lambde x len(x)>=4)) Ercoram Output -5 ‘Year Points 0 Warriors 1 (2004 8761 1 Fighters 2 2005 7892 2 Supers zB) 2004 ©8633 3. Fighters ... 9... 2005, 936704 Me Woardcre. 08), 200ty eas sie - A jarigtvers? A i: -eeeler Eee Owighters(iis gl, 1a 2a pape 7 Supers 1 2007 7888 3 Supers 2 2008 6949 z J: Fighters 4 (2008 aD Co ace tr OO gaa - 2007 6902 fa See 7 5 9}, ‘Supers' 2, 7,6,51), Warr (0.4, 101) j tn Ss ei ay a EN |37329 7465.800000 585.456403 30372 7593,000000 828.822860 24190 8063.333333 674.354753ization Exploratory De Data Exploration and Visual (1-48) ee ata Anais (Team ~~ Rank Year “Points ; z Fighters 2 2005 792 2 Supers 2 2008 9633 j (3 Fighters 3 2005 6734 5 Fightos 4 2005 8126 f 6 Fighters 1 2006 7567 (ear: 1 2007 7588 8 Supers 2 2006 e249 : 9 Fighters 4 2004 7010 ‘41. Supers 2007 ae 6. Pivottables and cross tabulations # A pivot table is a table of statistics that helps summarize the data of a larger table by “pivoting” that data. Microsoft Excel popularized the pivot table, where they are known as PivotTables. " Pivot table in pandas is a tool to summarize one or more numeric variable based on two other categorical variables. Pandas gives access to creating pivot tables in Python usi the :pivot_table() function. pandas.pivot_table(data, values=None, —_index=None, —_columns=Noy aggfunc="mean’, fill_value=None, margins=False, dropna=Tru: margins_name="All’) create a spreadsheet-style pivot table as a DataFrame. Levels in the pivot table will be stored in Multilndex objects (hierarchical indexes) the index and columns of the result DataFrame. Parameters : Data: DataFrame ‘© Values : Column to aggregate, optional Index : Column, Grouper, aray or list ofthe previous © Columns : Column, Grouper, array or list ofthe previous © Aggfune : Function, list of functions, dict, default numpy.mean, _ > If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names, ae > If dict is passed, the key is column to aggregate and value is function or list functions. ; TECHNICAL PUBLICATIONS® - an up-thrust for knowledge« fill_value[scalar, default None] : Value to replace missing values with margins[boolean, default False] : Add all row / columns (e.g. for subtotal / grand totals) ‘ « dropna[boolean, default True] : Do not include columns whose entries are all NaN margins_name{string, default ‘All’] : Name of the row / column that will contain the totals when margins is True. Returns : DataFrame Example Program -5Mi Last Name Sima es ee Department yay Ri Weni * eee Administration ao Fun. eee Technical x - Sita Pande a ae Administration : Kumar : maPloyes” Technica > Rita time Employee Management 6 : aen jement T cal te eaeoven 15000.0 20000. a Part-time Einpioyee ee sa 5000.0 sum nN pany 1000.0 mean count pelary ae Salary Salary Type uiltime Employee 0000 16686.666es7 9g a 5000 ea 5000.000000. a Part-time Employee 10000 10000.000000 3 eaiery Type function. TECHNICAL PUBLICATIONS® - en up-thrust for knowledgeMobile Surveys Z ? The: Mobile technology, Mol graphical analysis, It is also known as visual analytics or descriptive statistics, It is the and exploring data, before stating hypotheses, fitting predictors inferential goals. It typically includes the computat Practice of inspecting and other expected ion of simple summary statistics which capture some property of interest in the data and visualization. EDA can be thought of as an ion-free; purely algorithmic practice. assumption-free, purely alg . TECHNICAL PUBLICATIONS® - en up-thrust for knowledgefl Data Exploration and Visualization (1-52) Exploratory Datg phical methods for data analysis. play relationships between two or more setg plot or bar chart with each group represen 9 p representing the levels of 4 Q3_ Explain multivariate gral Ans. : Multivariate data uses graphics to dis data, The most used graphic is a grouped bar plot °F one level of one of the variables and each bar within a grou other variable. Common types of multivariate graphics include, 1 is ta points on a horizontal and a vertical axig th is used to plot data point tical axis | another. | Scatter plot, whic! show how much one variable is affected by Multivariate chart, which is a graphical representation of the relationships betweeq factors and a response. : Run chart, which is a line graph of data plotted over time. Q.4 Write a note on box plots and bar plots. — ‘Ans. : Box piots show the distribution of values along an axis. Rectangular boxes are useq in order to bucket the data, giving us an idea of how the data points are spread out. These| boxes are also called quartiles which represent a quarter of a data set. Boxes can be drawn vertically or horizontally. One can also easily spot the outliers, which are usually treated abnormal values and affect the data set’s overall observation due to their very high or low values. Box plots are suitable for identifying outliers. a Bar charts can be used to compare nominal or ordinal data. They are helpful for recognizit trends. 4 Q.5. Write a Python program to demonstrate use merge() function.” Ans. : Refer example program 1. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge"UNIT II Importing Matplotib - Simple line plots - Simple scatter plots - visualizing errors - density and contour plots - Histograms - legends - colors - subplots - text and annotation = customization - three dimensional plotting - Geographic Data with Basemap ~ Visualization with Seaborn. Contents Importing Matplotlib Pyplot Library and Plot Function Visualizing Errors 24 Scatter Plot g 2.5 Customizing Markers in Scatter Plots 2.6 Adding Legend in Scatter Plot Customizing the Colormap, Style and Legends Contour Plots Density Plots. Histograms Subplots 2.12 Three Dimensional Plotting 2.13 Geographic Data with Base Map 2.14. Visualization with Seaborn Two Marks Questions with AnswersData Exploration and Visualization (2-2) Visualizing using Matploty, 2.1 Importing Matplotiib ‘© Data visualisation means graphical or pictorial etc, The purpose of plotting data is to visual variables. Visualisation also helps to effectively communicate information to intended users, T; symbols, ultrasound reports, Atlas book of maps, speedometer of a vehicle, tuners op instruments are few examples of visualisation that one comes across in daily lives, ‘Visualisation of data is effectively used in fields like health, finance, science, ‘mathematics, engineering, etc. ‘This units covers how to visualise data using Matplotlib library of Python by plotting charts such as line, bar, scatter with respect to the various types of data i Matplotlib is one of the most popular Python packages used for data visualization, It is across-platform library for making 2D plots from data in arrays. Matplotlib jg written in Python and makes use of NumPy, the numerical mathematics extension of Python. It provides an object-oriented API that helps in embedding plots in applications using Python GUI toolkits such as PyQt, WxPythonotTkinter, 4 It can be used in Python and IPython shells, Jupyter notebook and web application servers also. Matplotlib has a procedural interface named the Pylab, which ig designed to resemble MATLAB, a proprietary programming language developed by MathWorks. Matplotlib alongwith NumPy can be considered as the open source: equivalent of MATLAB. Matplotlib was originally written by John D. Hunter in 2003. The current stable version is 3.6.0 released in 2022. i ‘Installation setup and installation verification for Matplotlib library © To check Python vers ymmand I representation of the data using graph, g ise variation or show relationships between TECHNICAL PUBLICATIONS® - an up-thrust for knowledge.EE TN PT OE TET ET pata Exploration and Visualization (9. Visualizing using Matploti a 2.2 Pyplot Library and Plot Function «The pyplot, a sublibrary of matplotlib, is a collection of functions that helps in creating a variety of charts. Each pyplot function makes some change to a figure: e.g, creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. “ In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area and the plotting functions are directed to the current axes (it should be noted that "axes" here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis). Plot function JFL satnloibspypiok plore Soalee= Tr, soalay= True, data=None,“"wargs) Plot y versus x lines and/or markers.args is a variable length argument, allowing for multiple x, y pairs with an optional format string. « Below are some of the plot() function signature. GaSe -None,*#iwasgs) MT Sg EEO [ive por(ae,(t [x2] v2, Ltt... aie) The horizontal / vertical coordinates of the data points. x values are opicaal and default to range(len(y)). Commonly, these parameters are 1D arrays. They can also be scalars, or two-dimensional (in that case, the columns represent separate ibe bs weet © A format string, e.g. 'to' for red circles. See the Notes section for a full 1 description of the format strings. Format strings are just an abbreviation for quickly setting basic line properties, All of these and more can also be controlled by keyword arguments, This argument cannot be passed as keyword. TECHNICAL PUBLICATIONS® - an up-thrust for knowfadge \Data Exploration and Visualization 2-4 Visualizing using Mey Format Strings © A format string consists of part for color, marker and line _ © Each of them is optional. If not provided, the value from the style cycle is used, Exception If line is given, but no marker, th ch as [colr}{marker[ine] are also supported, but note that ig .¢ data will be a line without markers. © Other combinations su parsing may be ambiguous. Markers point marker i pixel marker ‘o' circle marker © ‘vi triangle_down marker Cy triangle_up marker ' triangle_left marker > | triangle_right marker AY tri_down marker 2 tri_up marker 3 tri_left marker 4 tri_right marker 8 octagon marker v i : s square marker pentagon marker dg plus (filled) marker ; ee star marker ty’ __| hexagon! marker H hexagon2 marker TECHNICAL PUBLICATIONS® - an up-hrust for knowledge

Data Collection
No ratings yet
Data Collection
64 pages
Data Science
100% (2)
Data Science
68 pages
Methods of Data Collection: Industrial Engineering Department
No ratings yet
Methods of Data Collection: Industrial Engineering Department
30 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Business Data Mining Week 1
No ratings yet
Business Data Mining Week 1
10 pages
Data Collection Methods Ramirez
No ratings yet
Data Collection Methods Ramirez
10 pages
Research Instrument and Data Collection in Languange Learning Validity and Reliability of Research Data
100% (1)
Research Instrument and Data Collection in Languange Learning Validity and Reliability of Research Data
13 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
Eapp Examples
100% (1)
Eapp Examples
2 pages
Statistic & Analytics
No ratings yet
Statistic & Analytics
98 pages
Unit 1-Stat Data Collection & Types 2-1
No ratings yet
Unit 1-Stat Data Collection & Types 2-1
11 pages
BBA Unit 5
No ratings yet
BBA Unit 5
131 pages
HW32974
No ratings yet
HW32974
3 pages
Datacollectionmethod
No ratings yet
Datacollectionmethod
4 pages
Statistics Method of Data Collection
No ratings yet
Statistics Method of Data Collection
6 pages
Data and Data Collection
No ratings yet
Data and Data Collection
63 pages
Data Collection Methods
No ratings yet
Data Collection Methods
8 pages
Manaloto Arr 413 Data Collection
No ratings yet
Manaloto Arr 413 Data Collection
18 pages
Methods Are Used For The Collection of The Data
No ratings yet
Methods Are Used For The Collection of The Data
15 pages
Data Collection
No ratings yet
Data Collection
6 pages
GROUP 3 Research PPT Report
No ratings yet
GROUP 3 Research PPT Report
49 pages
TECH8000 Week 05
No ratings yet
TECH8000 Week 05
30 pages
RESEARCH DEVELOPMENT Lesson 7
No ratings yet
RESEARCH DEVELOPMENT Lesson 7
14 pages
Data Is A Collection
No ratings yet
Data Is A Collection
9 pages
Data For Research
No ratings yet
Data For Research
73 pages
Unit Ii
No ratings yet
Unit Ii
47 pages
Session 3 Data Collection Analysis and Interpretation
No ratings yet
Session 3 Data Collection Analysis and Interpretation
31 pages
DS Module2 L1 L11
No ratings yet
DS Module2 L1 L11
27 pages
Data Collection
No ratings yet
Data Collection
65 pages
Class 6
No ratings yet
Class 6
18 pages
Statistics in Education Data Collection
No ratings yet
Statistics in Education Data Collection
90 pages
Chapter III
No ratings yet
Chapter III
39 pages
Q3 L3 Collection of Data
No ratings yet
Q3 L3 Collection of Data
42 pages
Unit 2 BI & Data Science
No ratings yet
Unit 2 BI & Data Science
35 pages
Group 5 Pr1 Reporting Pres...
No ratings yet
Group 5 Pr1 Reporting Pres...
62 pages
DATA COLLECTION Methods and Sources
No ratings yet
DATA COLLECTION Methods and Sources
27 pages
ITE Elective Lecture Materials Data Colletion and Descriptive Statistics
No ratings yet
ITE Elective Lecture Materials Data Colletion and Descriptive Statistics
8 pages
Educational Material
No ratings yet
Educational Material
6 pages
Research Methods - Chapter 5 (Final Version)
No ratings yet
Research Methods - Chapter 5 (Final Version)
143 pages
Research 1
No ratings yet
Research 1
43 pages
MSCP HND Computing Lo3
No ratings yet
MSCP HND Computing Lo3
18 pages
Various Types of Statistical Data and Collection
No ratings yet
Various Types of Statistical Data and Collection
22 pages
What Is Primary Data? + (Examples & Collection Methods)
No ratings yet
What Is Primary Data? + (Examples & Collection Methods)
1 page
STATS
No ratings yet
STATS
9 pages
QMET
No ratings yet
QMET
15 pages
Types of Research Data
No ratings yet
Types of Research Data
5 pages
IM M2-Week 3-Organization & Presentation of Data-1
No ratings yet
IM M2-Week 3-Organization & Presentation of Data-1
16 pages
Data Collection Lecture
No ratings yet
Data Collection Lecture
10 pages
Unit 2 Data Science (1)
No ratings yet
Unit 2 Data Science (1)
25 pages
What Is Data Collection
No ratings yet
What Is Data Collection
8 pages
Welcome To: Govt. Digvijay Autonomous P.G. College, Rajnandgaon, C.G
No ratings yet
Welcome To: Govt. Digvijay Autonomous P.G. College, Rajnandgaon, C.G
22 pages
Unit 2 Rizvi Sir
No ratings yet
Unit 2 Rizvi Sir
92 pages
Wa0002.
No ratings yet
Wa0002.
34 pages
Unit 2
No ratings yet
Unit 2
105 pages
S&A Notes
No ratings yet
S&A Notes
5 pages
Are 202 Note
No ratings yet
Are 202 Note
48 pages
Data Process Improvement: Data Collection Is A Term Used To Describe A Process of Preparing and
No ratings yet
Data Process Improvement: Data Collection Is A Term Used To Describe A Process of Preparing and
8 pages

DEV UNIT 1&2 Notes

Uploaded by

DEV UNIT 1&2 Notes

Uploaded by

You might also like