Module 1: Data Visualization and Data Exploration: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Module 1 covers Data Visualization and Exploration, emphasizing the importance of visual representation of data through various tools and libraries. It discusses statistical concepts, operations using Numpy and Pandas, and the advantages and disadvantages of data visualization. Additionally, it highlights applications in business intelligence, financial analysis, and healthcare, along with data wrangling techniques to prepare data for visualization.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views40 pages
Module 1: Data Visualization and Data Exploration: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Module 1 covers Data Visualization and Exploration, emphasizing the importance of visual representation of data through various tools and libraries. It discusses statistical concepts, operations using Numpy and Pandas, and the advantages and disadvantages of data visualization. Additionally, it highlights applications in business intelligence, financial analysis, and healthcare, along with data wrangling techniques to prepare data for visualization.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40
Module 1: Data
Visualization and Data
Exploration
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Module 1: Data Visualization and Data Exploration • Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools and Libraries for Visualization • Overview of Statistics: Measures of Central Tendency, Measures of Dispersion, Correlation, Types od Data, Summary Statistics • Numpy: Numpy Operations - Indexing, Slicing, Splitting, Iterating, Filtering, Sorting, Combining, and Reshaping • Pandas: Advantages of pandas over numpy, Disadvantages of pandas, Pandas operation - Indexing, Slicing, Iterating, Filtering, Sorting and Reshaping using Pandas Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Data Visualization • Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. • Data visualization translates complex data sets into visual formats that are easier for the human brain to comprehend. This can include a variety of visual tools such as: • Charts: Bar charts, line charts, pie charts, etc. • Graphs: Scatter plots, histograms, etc. • Maps: Geographic maps, heat maps, etc. Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Tools for Visualization of Data • The following are the 10 best Data Visualization Tools 1.Tableau 2.Looker 3.Zoho Analytics 4.Sisense 5.IBM Cognos Analytics 6.Qlik Sense 7.Domo 8.Microsoft Power BI 9.Klipfolio 10.SAP Analytics Prepared Cloud by Dr. Ganesha Prasad, Dept. of AI & ML Advantages of Data Visualization: • Enhanced Comparison: Visualizing performances of two elements or scenarios streamlines analysis, saving time compared to traditional data examination. • Improved Methodology: Representing data graphically offers a superior understanding of situations, exemplified by tools like Google Trends illustrating industry trends in graphical forms. • Efficient Data Sharing: Visual data presentation facilitates effective communication, making information more digestible and engaging compared to sharing raw data. • Sales Analysis: Data visualization aids sales professionals in comprehending product sales trends, identifying influencing factors through tools like heat maps, and understanding customer types, geography impacts, and repeat customer behaviors. • Identifying Event Relations: Discovering correlations between events helps businesses understand external factors affecting their performance, such as online sales surges during festive seasons. • Exploring Opportunities and Trends: Data visualization empowers business leaders to uncover patterns and opportunities within vast datasets, enabling a deeper understanding of customer behaviors and insights into emerging business trends. Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Disadvantages of Data Visualization: • Can be time-consuming: Creating visualizations can be a time-consuming process, especially when dealing with large and complex datasets. • Can be misleading: While data visualization can help identify patterns and relationships in data, it can also be misleading if not done correctly. Visualizations can create the impression of patterns or trends that may not exist, leading to incorrect conclusions and poor decision-making. • Can be difficult to interpret: Some types of visualizations, such as those that involve 3D or interactive elements, can be difficult to interpret and understand. • May not be suitable for all types of data: Certain types of data, such as text or audio data, may not lend themselves well to visualization. In these cases, alternative methods of analysis may be more appropriate. • May not be accessible to all users: Some users may have visual impairments or other disabilities that make it difficult or impossible for them to interpret visualizations. In these cases, alternative methods of presenting data may be necessary to ensure accessibility. Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Applications of Data Visualization
• Business Intelligence and Reporting
• Financial Analysis • Healthcare • Marketing and Sales • Human Resources
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Data Wrangling • Data wrangling is the process of transforming raw data into a suitable representation for various tasks. It is the discipline of augmenting, cleaning, filtering, standardizing, and enriching data in a way that allows it to be used in a downstream task, which in our case is data visualization. • Data Wrangling is also known as Data Munging.
• Example:Books selling Website want to show top-selling books of
different domains, according to user preference. For example, if a new user searches for motivational books, then they want to show those motivational books which sell the most or have a high rating, etc. • But on their website, there are plenty of raw data from different users. Here the concept of Data Munging or Data Wrangling is used. As we know Data wrangling is not by the System itself. This process is done by Data Scientists. So, the data Scientist will wrangle data in such a way that they will sort the motivational books that are sold more or have high ratings or user buy this book with these package of Books, etc. On the basis of that, the new user will make a choice. This will explain the Prepared by Dr. Ganesha Prasad, Dept. of AI & ML importance of Data wrangling. Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Tools and Libraries for Visualization Commonly used tools are: • Non-coding tool – Tableau, power BI • Coding tool – Python, MATLAB and R
Note: libraries we will going to discuss in detail in the coming
chapters. Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Overview of Statistics • Statistics is a combination of the analysis, collection, interpretation, and representation of numerical data. • Probability is a measure of the likelihood that an event will occur and is quantified as a number between 0 and 1. • A probability distribution is a function that provides the probability for every possible event. A probability distribution is frequently used for statistical analysis. The higher the probability, the more likely the event. There are two types of probability distributions, namely: Discrete Continuous.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Discrete probability distribution A discrete probability distribution shows all the values that a random variable can take, together with their probability. The following diagram illustrates an example of a discrete probability distribution. If we have a six- sided die, we can roll each number between 1 and 6. We have six events that can occur based on the number that's rolled. There is an equal probability of rolling any of the numbers, and the individual probability of any of the six events occurring is 1/6: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Continuous probability distribution • A continuous probability distribution defines the probabilities of each possible value of a continuous random variable. The following diagram provides an example of a continuous probability distribution. This example illustrates the distribution of the time needed to drive home. In most cases, around 60 minutes is needed, but sometimes, less time is needed because there is no traffic, and sometimes, much more time is needed if there are traffic jams: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Measures of Central Tendency • Mean: The arithmetic average is computed by summing up all measurements and dividing the sum by the number of observations. The mean is calculated as follows:
• Median: This is the middle value of the ordered dataset. If there is
an even number of observations, the median will be the average of the two middle values. The median is less prone to outliers compared to the mean, where outliers are distinct values in data.
• Mode: Our last measure of central tendency, the mode is defined as
the most frequent value. There may be more than one mode in cases where multiple values Prepared are equally frequent. by Dr. Ganesha Prasad, Dept. of AI & ML Example • For example, a die was rolled 10 times, and we got the following numbers: 4, 5, 4, 3, 4, 2, 1, 1, 2, and 1.
The mean is calculated by summing all the events and dividing
them by the number of observations: (4+5+4+3+4+2+1+1+2+1)/10=2.7.
To calculate the median, the die rolls have to be ordered according
to their values. The ordered values are as follows: 1, 1, 1, 2, 2, 3, 4, 4, 4, 5. Since we have an even number of die rolls, we need to take the average of the two middle values. The average of the two middle values is (2+3)/2=2.5.
The modes are 1 and 4Prepared
since by Dr.they areDept. Ganesha Prasad, the of AItwo & ML most frequent events. Measures of Dispersion • Dispersion, also called variability, is the extent to which a probability distribution is stretched or squeezed. The different measures of dispersion are as follows:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Correlation Correlation describes the statistical relationship between two variables: • In a positive correlation, both variables move in the same direction. • In a negative correlation, the variables move in opposite directions. • In zero correlation, the variables are not related.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Types of Data
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
NumPy • It provides support for large n-dimensional arrays and has built- in support for many high-level mathematical and statistical operations. • Mean
• Median
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
• Var, std
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Basic NumPy Operations Indexing • Indexing elements in a NumPy array, at a high level, works the same as with built-in Python lists. Therefore, we can index elements in multi-dimensional matrices:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
• Slicing: Being able to easily slice parts of lists into new ndarrays is very helpful when handling large amounts of data
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Splitting • Splitting data can be helpful in many situations, from plotting only half of your timeseries data to separating test and training data for machine learning algorithms. There are two ways of splitting your data, horizontally and vertically. Horizontal splitting can be done with the hsplit method. Vertical splitting can be done with the vsplit method:
Iterating • Iterating the NumPy data structures, ndarrays, is also possible. It steps over the whole list of data one after another, visiting every single element in the ndarray once. Considering that they can have several dimensions, indexing gets very complex. The nditer is a multi-dimensional iterator object that iterates over a given number of arrays:
The ndenumerate will give us exactly
this index
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Filtering
• Filtering is a very powerful tool that can be used to
clean up your data if you want to avoid outlier values.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Sorting • Sorting each row of a dataset can be really useful. Using NumPy, we are also able to sort on other dimensions, such as columns. • In addition, argsort gives us the possibility to get a list of indices, which would result in a sorted list:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Combining • Stacking rows and columns onto an existing dataset can be helpful when you have two datasets of the same dimension saved to different files. • Given two datasets, we use vstack to stack dataset_1 on top of dataset_2, which will give us a combined dataset with all the rows from dataset_1, followed by all the rows from dataset_2. • If we use hstack, we stack our datasets "next to each other," meaning that the elements from the first row of dataset_1 will be followed by the elements of the first row of dataset_2. This will be applied to each row:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Reshaping • Reshaping can be crucial for some algorithms. Depending on the nature of your data, it might help you to reduce dimensionality to make visualization easier:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Pandas • The pandas Python library provides data structures and methods for manipulating different types of data, such as numerical and temporal data. These operations are easy to use and highly optimized for performance. Data formats, such as CSV and JSON, and databases can be used to create DataFrames. • DataFrames are the internal representations of data and are very similar to tables but are more powerful since they allow you to efficiently apply operations such as multiplications, aggregations, and even joins. Importing and reading both files and in-memory data is abstracted into a user-friendly interface. When it comes to handling missing data, pandas provide built-in solutions to clean up and augment your data, meaning it fills in missing values with reasonable values.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
• Integrated indexing and label-based slicing in combination with fancy indexing (what we already saw with NumPy) make handling data simple. More complex techniques, such as reshaping, pivoting, and melting data, together with the possibility of easily joining and merging data, provide powerful tooling so that you can handle your data correctly.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Basic Operations of pandas Indexing: • Indexing with pandas is a bit more complex than with NumPy. We can only access columns with a single bracket. To use the indices of the rows to access them, we need the iloc method. If we want to access them with index_col (which was set in the read_csv call), we need to use the loc method:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML Series
• A pandas Series is a one-dimensional labeled array that is
capable of holding any type of data. We can create a Series by loading datasets from a .csv file, Excel spreadsheet, or SQL database.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Advanced pandas Operations • Filtering Filtering in pandas has a higher-level interface than NumPy. You can still use the simple brackets-based conditional filtering. However, you're also able to use more complex queries, for example, filter rows based on labels using likeness, which allows us to search for a substring using the like argument and even full regular expressions using regex:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Sorting • Sorting each row or column based on a given row or column will help you analyze your data better and find the ranking of a given dataset. With pandas, we are able to do this pretty easily. Sorting in ascending and descending order can be done using the parameter known as ascending. The default sorting order is ascending. Of course, you can do more complex sorting by providing more than one value in the by = [ ] list. Those will then be used to sort values for which the first value is the same:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Reshaping • Reshaping can be crucial for easier visualization and algorithms. However, depending on your data, this can get really complex: