We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28
DATA SCIENCE
TYCS SEMESTER VI UNIT 1 By Asst Prof. Bindy Wilson
Asst Prof. Bindy Wilson
Introduction •Statistical Learning - field of statistics that largely involve computational considerations •Machine Learning - build computer systems that automatically improve with experience •Artificial Intelligence - the science and engineering of making intelligent machines that are experts in linguistics, philosophy, psychology, neuroscience, mathematics, computer science and so on •Data Mining – science of knowledge discovery using database systems, ML and statistics •Data Science - big umbrella that brings everything together with a potential to show insight from data and build intelligent systems inside it Asst Prof. Bindy Wilson Data Science ⚫ An interdisciplinary field that uses technology from comp sc., database, statistics & machine learning ⚫ Involves collection, preparation, analysis, visualization, management and preservation of data ⚫ Extracts meaningful information from data sources to be used for business purposes ⚫ Use that knowledge to ⚫ • Make decisions ⚫ • Predict the future
Asst Prof. Bindy Wilson
Types of Data ⚫ Data is a collection of facts in a format that can be processed by a computer. ⚫ Two data types ⚫ Quantitative data(Numerical Variables): Data can be described using numbers, and basic mathematical procedures. Eg : the salary of employees ⚫ Qualitative data(Categorical Variables) : This data cannot be described using numbers and basic mathematics. Eg : gender or country name Asst Prof. Bindy Wilson Categorical or Qualitative data ⚫ classified into Nominal, Binary, and Ordinal ⚫ Nominal - These are variables without any regard for ordering. For example, candidate names in polling data from a survey. ⚫ Ordinal - can have two or more categories with an added condition how the categories are ordered. For example, a customer rating for a movie, variable rating has a relative importance on a scale of 1 to 5 ⚫ Binary - variables with exactly two categories such as gender, possible outcomes of a single coin toss, etc Asst Prof. Bindy Wilson Numerical or Quantitative data ⚫ measurable and represented as numbers, not words or text ⚫ divided into continuous and discrete ⚫ Discrete variables have a logical end to them, eg. days in a month. Continuous variables don’t have a logical end to them, subdivided into Interval and Ratio ⚫ Interval - measured along a continuous range. 0o C has certain degree of temperature ⚫ Ratio - include distance, mass, and height. A value 0 for a ratio variable means a none or no measure. Asst Prof. Bindy Wilson Traditional vs Big Data ⚫ Traditional data – structured & stored in databases in table format, contains numeric or text values, usually managed in a single system ⚫ Big data – distributed across network of computers and is bigger in 5 V’s ⚫ Volume – enormous volume ⚫ Variety – many sources & types, photos, videos, audio, PDF, data from sensors, monitoring devices.. ⚫ Velocity – massive & continuous data flow ⚫ Veracity – uncertain, imprecise, abnormal data ⚫ Validity – if accurate for intended use Asst Prof. Bindy Wilson Different types of data sources 1) Structured - always the easiest to understand, represent, store, query, and process ⚫ data will have rows and columns stored in a tabular manner ⚫ data coming from CSV and Excel files 2) Semi-Structured - is the web data that consists of XML, HTML etc ⚫ data generated from Twitter and Facebook ⚫ Stored in NoSQL Databases like MongoDB and Cassandra 3) Unstructured - data like images, videos, web logs, and click stream, and also data from newspapers and books which are non-digitized data. Asst Prof. Bindy Wilson The Five Steps of Data Science ⚫ 1. Asking an interesting question ⚫ 2. Obtaining the data - finding the right data set ⚫ 3. Exploring the data - understanding the data ⚫ 4. Modeling the data - involves the use of statistical and machine learning models ⚫ 5. Communicating and visualizing the results - conclude your results in a digestible format
Asst Prof. Bindy Wilson
Data Collection ⚫ Primary and Secondary data ⚫ Primary data is data originated for the first time by the researcher through direct efforts and experience ⚫ Also known as the first hand or raw data ⚫ The data collected surveys, observations, physical testing, mailed questionnaires, interviews ⚫ Secondary data is second-hand information which is already collected and recorded by any other person ⚫ Readily from various sources like censuses, government publications, internal records of the organisation, reports, books, journal articles, websites Asst Prof. Bindy Wilson Various types of data collection methods ⚫ 1)Companies and Proprietary Data Sources ⚫ 2)Government Data Sources ⚫ 3)Academic Data Sets ⚫ 4)Sweat Equity ⚫ 5)Scraping
⚫ A) Casual & Scientific
⚫ B) Simple & Systematic ⚫ C) Subjective & Objective ⚫ D) Factual & Inferential ⚫ E) Direct & Indirect ⚫ F) Behavioral & Non-behavioral Asst Prof. Bindy Wilson Web scraping ⚫ used for extracting data from websites ⚫ Web scraping a web page involves fetching it and extracting from it ⚫ The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet ⚫ Human Copy-Paste ⚫ Text pattern matching - using regular expression matching facilities of programming languages ⚫ API Interface ⚫ DOM Parsing Asst Prof. Bindy Wilson Data wrangling or Data cleaning ⚫ Initial Data Analysis (IDA) ⚫ Removing inconsistencies from the data, like missing values, and follow a standard format ⚫ Correcting Factor Variables ⚫ Dealing with NAs - impute, a process of filling the missing values ⚫ Dealing with Dates and Times
Asst Prof. Bindy Wilson
Handling missing data ⚫ Heuristic-based imputation - make a reasonable guess ⚫ Mean value imputation- Using the mean value of a variable ⚫ Random value imputation - select a random value from the column ⚫ Imputation by nearest neighbor - identify the record which matches most closely on all fields, and use this nearest neighbor to infer the values ⚫ Imputation by interpolation - use a method like linear regression to predict the value
Asst Prof. Bindy Wilson
Exploratory Data Analysis (EDA) ⚫ Fundamental step after data collection and pre-processing ⚫ Most EDA techniques are graphical in nature ⚫ Objectives of EDA : ⚫ 1) Maximize insight into the database ⚫ 2) Visualize relationships between exposure and outcome variables ⚫ 3) Detect outliers and anomalies ⚫ 4) Extract and create relevant variables ⚫ 5) Finding a suitable model Asst Prof. Bindy Wilson ⚫ EDA methods - Graphical or non-graphical ⚫ Non-graphical - Summary statistics include, frequency, mean, median, mode, range, interquartile range, maximum and minimum values ⚫ Graphical - Data visualization, multiple types of charts, graphs
Asst Prof. Bindy Wilson
Summary Statistics ⚫ Mean - we sum values and divide by the number of observations ⚫ Median - middle value in a sorted data set ⚫ Variance - measure of the spread for the given set of numbers ⚫ Interquartile range (IQR) - data situated between the 1st and the 3rd quartiles ⚫ Skewness - measures asymmetry about the mean Asst Prof. Bindy Wilson Summary Statistics (contd) ⚫ Kurtosis - measure of peakedness and tailedness of the probability distribution of a random variable ⚫ Covariance and correlation - measure the degree of the relationship between two random variables ⚫ If two variables have a correlation close to -1, it means that as one variable increases, the other decreases, and if two variables have a correlation close to +1, it means that those variables move together in the same direction
Asst Prof. Bindy Wilson
Data visualization ⚫ the process of creating and studying the visual representation of data to bring some meaningful insights ⚫ deals with visualizing the information in a given data ⚫ Benefits ⚫ • Identifying red spots in data ⚫ • Tracking and identifying relations among different attributes ⚫ • Seeing the trend ⚫ • Summarizing complicated long spreadsheets and databases into visual art ⚫ • Easy to use and very impactful way to store and present information Asst Prof. Bindy Wilson Data visualization ⚫ Four types of presentation in data visualization Comparison, Relationship, Distribution, and Composition ⚫ Comparison is used to see the differences between multiple items at given point in time Eg – Line chart ⚫ Relationship helps in finding correlation between two or more variables Eg - Scatter and bubble ⚫ Distribution charts like column and line histograms show the spread of data. Skewness toward left or right could be easily spotted. ⚫ Composition refers to a stacked chart with multiple components like a pie chart or stacked column chart Asst Prof. Bindy Wilson Boxplots ⚫ Boxplots are a compact way of representing the five-number summary namely median, first and third quartiles (25th and 75th percentile) and min and max. ⚫ The upper side of the vertical rectangular box represents the third quartile and the lower, the first quartile. The difference between the two points is known as the interquartile range, which consist of 50% of the data. ⚫ A line dividing the rectangle represents the median. ⚫ It also contains a line extending on both sides (known as whiskers) of the rectangle ⚫ The points plotted, which are shown as extensions of the lines, are called outliers.
Asst Prof. Bindy Wilson
Line Chart ⚫ A line chart is a basic visualization chart type in which information is displayed in a series of data points connected by line segments. Line charts are used for showing trends
Asst Prof. Bindy Wilson
Scatterplots ⚫ A scatterplot is a graph that helps identify if there is a relationship between two variables. ⚫ Scatterplots use Cartesian coordinates to show two variables on an x- and y-axis. Higher dimensional scatterplots are also possible ⚫ If we add dimensions of color or shape or size, so we can present more than two variables
Asst Prof. Bindy Wilson
Correlation Plots ⚫ The best way to show how much one indicator relates to another is by computing the correlation. ⚫ The combination of color, size, and position encapsulates a numeric value into a visual representation in a Correlation plot ⚫ Direction of the ellipse represents a positive or negative correlation & size represents the value
Asst Prof. Bindy Wilson
⚫ Stacked column charts are an elegant way of showing the composition of various categories that make up a particular variable ⚫ A histogram is one of the most basic and easy to understand graphical representations of numerical data. ⚫ It consists of rectangular boxes. The width of each rectangle has a certain range and the height signifies the number of data points within that range.
Asst Prof. Bindy Wilson
⚫ A Pie Chart is a type of graph that uses pie slices to show relative sizes of data. ⚫ Heatmaps are visualization of data where values are represented as different shades of colors, darker the shade, higher is the value. ⚫ Dendograms are visual representations specifically useful in clustering analysis. They are tree diagrams frequently used to illustrate the formation of clusters ⚫ The y-axis in dendograms measures the closeness (or similarity) of clusters.
Asst Prof. Bindy Wilson
High level Programming language ⚫ A high-level language (HLL) is a programming language that enables a programmer to write programs that are independent of a particular type of computer. ⚫ They are closer to human languages and further from machine languages. ⚫ Assembly languages are considered low-level because they are very close to machine languages. ⚫ Advantages of high-level language ⚫ High-level languages are programmer-friendly. They are easy to write, debug and maintain. ⚫ It provides a higher level of abstraction from machine languages. ⚫ It is a machine-independent language. ⚫ Easy to learn. ⚫ Less error-prone, easy to find and debug errors. ⚫ High-level programming results in better programming productivity.
Asst Prof. Bindy Wilson
Integrated Development Environment (IDE ) ⚫ An IDE enables programmers to consolidate the different aspects of writing a computer program. ⚫ It is a software for building applications that combines common developer tools into a single graphical user interface (GUI) ⚫ Development tools include text editors, code libraries, compilers and test platforms ⚫ An IDE typically offers ⚫ a text editor, ⚫ automated code validation, ⚫ syntax highlighting, ⚫ auto completion, ⚫ contextual suggestions, ⚫ easy access to help, and ⚫ debugging tools