0% found this document useful (0 votes)

7 views

Module 1 PPT

The document outlines the vision, mission, and educational objectives of the Department of Computer Science and Engineering at SVCE, Bengaluru, emphasizing the importance of program accreditation and stakeholder perception. It details the program outcomes and specific outcomes for graduates, focusing on skills in data science, data analysis, and visualization. Additionally, it provides an overview of course modules, learning resources, and the significance of data science in various fields, including healthcare and business.

Uploaded by

Divyaraj

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Module 1 PPT

Uploaded by

Divyaraj

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 96

DATA SCIENCE AND

VISUALIZATION
18CS644
Why NBA ????
• Program Accreditation
• Washington Accord
• Branding & Bragging
• Stakeholder Perception
• No escape from statutory and regulatory bodies like
AICTE, MCI etc irrespective of institutional rankings
or status
• Required for AICTE Approval, Extension of Approval,
New programs and seat increase

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 2

Institute Vision and Mission
Our Vision
• To be a premier institute for addressing the challenges in global
perspective.
Our Mission
• M1: Nurture students with professional and ethical outlook to
identify needs, analyze, design and innovate sustainable solutions
through lifelong learning in service of society as individual or a
team.
• M2: Establish state-of-the-art Laboratories and Information
Resource centre for education and research.
• M3: Collaborate with Industry, Government Organization and
Society to align the curriculum and outreach activities.
28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 3
Department Vision and Mission
Our Vision
• To be a school of Excellence in Computing for Holistic Education and Research
Our Mission
• M1: Accomplish academic achievement in Computer Science and Engineering
through student-centered creative teaching learning, qualified faculty members,
assessment and effective usage of ICT.
• M2: Establish a Center of Excellence in a various verticals of computer science
and engineering to encourage collaborative research and Industry-institute
interaction.
• M3:Transform the engineering students to socially responsible, ethical,
technically competent and Value added professional or entrepreneur through
holistic education.
28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 4
Program Educational Objectives
• Knowledge: Computer Science and Engineering Graduates will have
professional technical career in inter disciplinary domains providing
innovative and sustainable solutions using modern tools.
• Skills: Computer Science and Engineering Graduates will have effective
communication, leadership, team building, problem solving, decision making
and creative skills.
• Attitude: Computer Science and Engineering Graduates will practice ethical
responsibilities towards their peers, employers and society.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 5

Program Specific Outcomes
• PSO 1: Ability to adopt quickly for any domain, interact with
diverse group of individuals and be an entrepreneur in a
societal and global setting.

• PSO 2: Ability to visualize the operations of existing and

future software Applications.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 6

Program Outcomes
• Engineering knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
• Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
• Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified
needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 7

Program Outcomes
• Conduct investigations of complex problems: Use research-based
knowledge and research methods including design of experiments,
analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
• Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and
modeling to complex engineering activities with an understanding of the
limitations.
• The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to the professional engineering
practice.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 8

Program Outcomes
• Environment and sustainability: Understand the impact of the
professional engineering solutions in societal and environmental
contexts, and demonstrate the knowledge of, and need for
sustainable development.
• Ethics: Apply ethical principles and commit to professional ethics
and responsibilities and norms of the engineering practice.
• Individual and team work: Function effectively as an individual, and
as a member or leader in diverse teams, and in multidisciplinary
settings.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 9

Program Outcomes
• Communication: Communicate effectively on complex
engineering activities with the engineering community and with
society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
• Project management and finance: Demonstrate knowledge and
understanding of the engineering and management principles and
apply these to one’s own work, as a member and leader in a team,
to manage projects and in multidisciplinary environments.
• Life-long learning: Recognize the need for, and have the
preparation and ability to engage in independent and life-long
learning in the broadest context of technological change.
28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 10
Course Outcomes
At the end of the course the student will be able to:
CO 1. Understand the data in different forms
CO 2. Apply different techniques to Explore Data Analysis and the
Data Science Process.
CO 3. Analyze feature selection algorithms & design a recommender
system.
CO 4. Evaluate data visualization tools and libraries and plot graphs.
CO 5. Develop different charts and include mathematical expressions.
Syllabus
Module-1
Introduction to Data Science
Introduction: What is Data Science? Big Data and Data Science hype – and
getting past the hype, Why now? – Datafication, Current landscape of
perspectives, Skill sets. Needed Statistical Inference: Populations and samples,
Statistical modelling, probability distributions, fitting a model.
Module-2
Exploratory Data Analysis and the Data Science Process
Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA,
The Data Science Process, Case Study: Real Direct(online realestate firm).
ThreeBasic Machine LearningAlgorithms: Linear Regression, k-Nearest
Neighbours (k- NN), k-means.
Module-3
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention. Feature
Generation (brainstorming, role of domain expertise, and place for imagination), Feature
Selection algorithms. Filters; Wrappers; Decision Trees; Random Forests.
Recommendation Systems: Building a User-Facing Data Product, Algorithmic ingredients
of a Recommendation Engine, Dimensionality Reduction, Singular Value Decomposition,
Principal Component Analysis, Exercise: build your own recommendation system.
Module-4
Data Visualization and Data Exploration
Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools
and Libraries for Visualization Comparison Plots: Line Chart, Bar Chart and Radar Chart;
Relation Plots: Scatter Plot, Bubble Plot , Correlogram and Heatmap; Composition Plots:
Pie Chart, Stacked Bar Chart, Stacked Area Chart, Venn Diagram; Distribution Plots:
Histogram, Density Plot, Box Plot, Violin Plot; Geo Plots: Dot Map, Choropleth Map,
Connection Map; What Makes a Good Visualization?
Module-5
A Deep Dive into Matplotlib
Introduction, Overview of Plots in Matplotlib, Pyplot Basics: Creating
Figures, Closing Figures, Format Strings, Plotting, Plotting Using
pandas DataFrames, Displaying Figures, Saving Figures; Basic Text and
Legend Functions: Labels, Titles, Text, Annotations, Legends; Basic
Plots:Bar Chart, Pie Chart, Stacked Bar Chart, Stacked Area Chart,
Histogram, Box Plot, Scatter Plot, Bubble Plot; Layouts: Subplots, Tight
Layout, Radar Charts, GridSpec; Images: Basic Image Operations,
Writing Mathematical Expressions
Suggested Learning Resources
Textbooks

1. Doing Data Science, Cathy O’Neil and Rachel Schutt, O'Reilly Media, Inc O'Reilly Media, Inc,

2013

2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN

9781800568112

Reference:

1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman, Cambridge University

Press, 2010

2. Data Science from Scratch, Joel Grus, Shroff Publisher /O’Reilly Publisher Media
Module-1
Introduction to Data Science
• Data science is like being a detective, but instead of solving
crimes, you're solving puzzles hidden in data.

• Imagine you have a massive collection of puzzle pieces

scattered all over the place. Each piece represents a different
aspect of information, like numbers, words, or images.
Key Components of Data Science:
• Data Collection:
• Just like a detective gathers clues, data scientists collect different types of
information from various sources. This could be anything from customer
reviews, sensor readings, social media posts, or even weather data.

• Data Cleaning and Preparation:

• Once we have our puzzle pieces, we need to clean and organize them. Sometimes
pieces are missing or don't fit quite right, so we have to tidy up the data to
make sense of it. Think of this as sorting through the puzzle pieces, discarding the
ones that don't belong, and making sure everything is in the right order.
• Data Analysis:
• Now that we have our clean and organized data, it's time to analyze it. This
is where we start putting the puzzle together. We look for patterns, trends,
and insights that can help us understand the story behind the data.

• Data Visualization:
• Data visualization is like creating a picture of the solved puzzle. Instead of
just looking at rows and columns of numbers, we use charts, graphs, and
other visual tools to make the information easier to understand and interpret.
Examples:

• Netflix Recommendations:
• When Netflix suggests a movie or show you might like based on
what you've watched before, that's data science in action. It's
analyzing your viewing history (data), finding patterns, and
making predictions about what you might enjoy next.
• Predicting Weather:
• Weather forecasting relies on collecting and analyzing
data from various sources like satellites, weather
stations, and sensors. By crunching this data,
meteorologists can predict things like temperature,
precipitation, and storms.
Healthcare Analytics:

• Hospitals use data science to improve patient care

and outcomes. For example, analyzing patient
records can help identify trends in diseases, predict
potential outbreaks, or personalize treatment plans.
What is Data & Science?
• The term “data science” combines two key elements: “data” and
“science.”
Data:
• It refers to the raw information that is collected, stored, and
processed.
• In today’s digital age, enormous amounts of data are generated
from various sources such as sensors, social media, transactions,
and more.
• This data can come in structured formats (e.g., databases) or
unstructured formats (e.g., text, images, videos).
Science:

• It refers to the systematic study and investigation of

phenomena using scientific methods and principles.

• Science involves forming hypotheses, conducting

experiments, analyzing data, and drawing conclusions
based on evidence.
What is the difference between DS and ML?

• Data science provides the framework and tools for

extracting insights from data, while machine
learning is a subset of data science that focuses on
developing algorithms for automated learning and
prediction.
Big Data and Data
Science Hype
The topic says about some common concerns and
misconceptions surrounding data science,
especially regarding "Big Data" and its relationship
with traditional research fields like statistics.
• Lack of Clear Definitions: People often use terms like "Big Data" and "data
science" without clear definitions, making them seem meaningless or confusing.

• Lack of Respect for Traditional Researchers: There's a disregard for the decades
of work done by researchers in various fields like statistics, computer science,
and engineering, who laid the groundwork for modern data science.

• Excessive Hype: There's an exaggeration and hype around data science, leading to
unrealistic expectations and making it harder to see its real value.

• Overlap with Statistics: Data science is sometimes seen as just a rebranding of

statistics, which can be frustrating for statisticians who feel their field is being
• In simpler terms, the passage is saying that people often
talk about data science without really understanding what
it means, they don't appreciate the work that went into it
before, they hype it up too much, and they overlook the
connection with traditional fields like statistics.
Getting Past the Hype
• Rachel's Experience: Summary of Rachel's experience transitioning
from studying statistics to working at Google.
• Quote from Rachel: "It was clear to me pretty quickly that the stuff I
was working on at Google was different than anything I had learned at
school when I got my PhD in statistics.“
• Rachel's investigation into data science through meetings and teaching
a course at Columbia aimed to clarify the emerging field's meaning
and significance.
• Ultimately, the goal of the book is to help more people understand
the reality of data science.
Datafication

• Datafication refers to the process of converting various

aspects of human life and activities into digital data that can
be stored, analyzed, and utilized for various purposes.

• This includes transforming behaviors, interactions, and

transactions, both online and offline, into quantifiable data
points.
• Datafication enables the collection, processing, and
interpretation of vast amounts of information, often with
the aim of gaining insights, making predictions, and
driving decision-making in diverse fields such as business,
healthcare, education, and governance.
Example
• For instance, when a user "likes" a post on Facebook, this action is recorded as
data, contributing to the user's profile and providing insights into their
preferences and interests.

• Over time, as users continue to engage with the platform, their data
profiles become increasingly detailed and comprehensive.

• This data can then be analyzed and used by the platform to personalize user
experiences, recommend content, target advertisements, and optimize
engagement.
The Current Landscape (with a
Little History)

What is data science? Is it new, or is it just

statistics or analytics rebranded? Is it real, or is it
pure hype? And if its new and if its real, what
does that mean?
What is Data Science?
 Data science is using data to answer questions.
 Data science is the science of analyzing raw data using
statistics and machine learning techniques with the
purpose of drawing conclusions about that
information.
Drew Conways Venn diagram of data science from 2010
Data Science involves:
• Statistics, computer science,
mathematics
• Data cleaning and
formatting
• Data visualization
• Hacking skills refer to the ability to manipulate data efficiently
using programming languages and tools.
• Mathematical and statistical knowledge is essential for analyzing
and deriving insights from the data collected.
• Substantive expertise involves having a deep understanding of
the domain or field in which the data is being analyzed.
• These three components intersect to form the core of data
science, where successful practitioners can collect, clean,
analyze, and interpret data effectively.
• However, there is a danger zone where individuals may possess
hacking skills and substantive expertise but lack
understanding of mathematical and statistical concepts,
potentially leading to misleading or misinterpreted analysis
results.
Data Science Jobs
• The paragraph highlights the growing demand for data scientists, with Columbia
University establishing an Institute for Data Sciences and Engineering and numerous
job openings in New York City alone.

• It acknowledges that data science encompasses a wide range of skills, including

computer science, statistics, communication, data visualization, and domain
expertise.

• However, it emphasizes that it's unrealistic for one individual to be an expert in all these
areas, suggesting that building teams with diverse skill sets is more effective.

• This approach allows teams to specialize in different aspects of data science and
Skill Sets
• It is important to develop a diverse skill set as a data scientist,
particularly focusing on statistical thinking in the age of Big Data.

• It emphasizes that while foundational knowledge in statistics, linear

algebra, and programming is crucial, data scientists also need to
develop parallel skill sets in data preparation, modeling, coding,
visualization, and communication.

• These skills are interdependent and essential for effectively working

with data.
Statistical Thinking

Example: Analyzing Students' Performance in AI and ML Course.

(Only for your Understanding, Don’t Write in exam)

• Imagine I want to analyze how well your 5th-semester

students performed in the AI and ML course to improve my
current Data Science and Visualization teaching strategies.
Here’s how i can apply statistical thinking:
1. Data Collection
2. Descriptive Statistics
3. Data Visualization
4. Inferential Statistics
5. Correlation Analysis
6. Making Conclusions
7. Improving Teaching Strategies
8. Feedback Loop
Data Collection

• Gather the final grades of all students from the AI

and ML course last semester.
Descriptive Statistics
• Mean: Calculate the average grade to understand the overall performance level
of the class.

• Median: Identify the middle grade to get a sense of the central tendency without
extreme grades skewing the result.

• Mode: Determine the most frequently occurring grade, which can indicate a
common performance level.

• Range: Find the difference between the highest and lowest grades to understand
the spread of student performance.
Data Visualization

• Histogram: Create a histogram of the grades to visualize the

distribution. This can show how grades are spread across
different ranges.

• Box Plot: Use a box plot to display the median, quartiles, and
any potential outliers in the grades.
Inferential Statistics
• Standard Deviation: Calculate the standard deviation of the grades to
understand the variability. A low standard deviation means grades are close
to the mean, while a high standard deviation indicates more variation.

• Z-scores: Convert grades to z-scores to see how many standard deviations

each grade is from the mean, helping identify significantly high or low
performers.
Correlation Analysis

• Correlation with Attendance: Analyze if there is a correlation between

students' attendance and their grades. This can help you understand if
regular attendance is a significant factor in student performance.

• Correlation with Assignments: Check the correlation between students'

performance in assignments and their final grades. This can indicate if
consistent performance throughout the course impacts the final result.
Making Conclusions
• Identify trends, such as whether most students scored within a
certain range or if there are outliers who performed exceptionally
well or poorly.

• Determine if there are specific topics where students struggled,

based on clustering of lower grades around certain assignments
or exam questions.
Improving Teaching Strategies:

• Based on your analysis, decide if you need to adjust your

teaching methods, such as providing additional resources or
focusing more on difficult topics.

• Offer additional support or office hours for students who are

identified as outliers with low performance to help them catch
up.
Improving Teaching Strategies:

• Based on your analysis, decide if you need to adjust your

teaching methods, such as providing additional resources or
focusing more on difficult topics.

• Offer additional support or office hours for students who are

identified as outliers with low performance to help them catch
up.
Feedback Loop:
• Use the insights gained to give personalized feedback to students,
highlighting their strengths and areas for improvement.

• Incorporate findings into your current Data Science and

Visualization course to enhance teaching effectiveness and
student learning outcomes.
Statistical thinking
• Statistical thinking is the process of using data to understand the world,
make decisions, and solve problems.

• It involves:

1.Collecting Data: Gathering information from various sources.

2.Analyzing Data: Looking at the numbers to find patterns and trends.

3.Interpreting Data: Figuring out what the patterns and trends mean.

4.Making Decisions: Using the insights from the data to make informed choices.
Note:
• Statistical thinking is a way of understanding a
complex world by describing it in relatively
simple terms that nonetheless capture essential
aspects of its structure or function, and that also
provide us some idea of how uncertain we are
about that knowledge.
Statistical Inference
Statistical Inference

• Statistical inference is the process of drawing

conclusions or making predictions about a
population based on data collected from a
sample of that population.
In Layman Terms

• Statistical inference is a way of making educated

guesses about a large group based on a smaller
sample of data from that group
Populations and Samples

• In statistics, understanding the difference

between a population and a sample is
fundamental to many aspects of data analysis
and inference.
Population Vs Sample
Population

• In statistics, the population is the entire set of items

from which data is drawn in the statistical study.

• It can be a group of individuals or a set of items.

• The population is usually denoted by N.

Sample
• A sample is a subset of the population selected for study.

• It is a representative portion of the population from

which we collect data in order to make inferences or
draw conclusions about the entire population.

• It is denoted by n.
Population Vs Sample
Population Sample
The population includes all A sample is a subset of the
members of a specified group. population.
Collecting data from an entire Samples offer a more feasible
population can be time- approach to studying populations,
consuming, expensive, and allowing researchers to draw
sometimes impractical or conclusions based on smaller,
impossible. manageable datasets

Consists of 1000 households, a

Includes all residents in the city.
subset of the entire population.
Populations and Samples of Big Data
Populations

• Definition: A population is the entire group of items, individuals, or events

that you're interested in studying.

• Example: Imagine you want to understand the habits of all the users on a
social media platform. Here, all the users of the platform are your population.

• Big Data Context: In big data, a population can be extremely large. For
example, it could include all the tweets made on Twitter, all the transactions
made on Amazon, or all the rides taken using a ride-sharing app.
Samples
• Definition: A sample is a smaller group selected from the population. It's like
taking a small piece to understand the whole.

• Example: Instead of analyzing all the tweets ever made (which is the population),
you might look at a random selection of 10,000 tweets (this is your sample).

• Big Data Context: Even though big data technology can handle massive amounts
of information, analyzing a sample can still be useful. It allows you to make
inferences about the entire population without needing to process all the data,
which can be time-consuming and expensive.
Summary
• Even though we can record all data, we still use samples to make data handling
easier and to draw accurate conclusions.

• Big companies like Google often use samples because it's more efficient.

• The data we collect might not always represent the whole picture.

• For example, tweets during Hurricane Sandy (Hurricane Sandy was a powerful and
destructive storm that affected the Caribbean and the eastern United States in late
October 2012. It was one of the most significant hurricanes in recent history due to its
size, impact, and the damage it caused) mostly came from New Yorkers, giving a
skewed view of the event.
• Sampling helps avoid biases and makes data more manageable. Also,
even if we have all data from a company, it's still just one version of
what could have happened, and we use it to understand the bigger
picture of behaviors or trends.

• It's essential to remember that any conclusions drawn from samples

might not apply to the entire population without further investigation
and understanding.
Statistical Thinking vs Statistical Inference
Statistical Thinking

• Definition: Statistical thinking is the overall mindset and approach to

understanding, analyzing, and interpreting data to make informed decisions.

Statistical Inference

• Definition: Statistical inference is a specific aspect of statistical thinking that

involves making predictions or generalizations about a larger population
based on a sample of data.
Important Note on SAMPLE

• Selecting a good sample is important because it

ensures your research results are accurate and
truly represent the whole group you're studying.
Modelling

• Modeling refers to the process of creating simplified

representations of complex systems or phenomena to
aid understanding, prediction, or decision-making.
Model
• In simple terms, a model in data science is like a tool or a recipe.

• Imagine you're trying to bake a cake. You follow a recipe that tells you
what ingredients to use and how to mix them together.

• In data science, a model is similar—it's a set of rules or equations

that we use to understand or predict things based on data.
What is model?

• A model is our attempt to understand and represent the

nature of reality through a particular lens, be it
architectural, biological, or mathematical.

• A model is an artificial construction where all extraneous

detail has been removed or abstracted.
Statistical modelling

• Before you get too involved with the data and start coding,
its useful to draw a picture of what you think the underlying
process might be with your model.

• Statistical modeling is like making educated guesses about

the world around us using math and data.
• In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for
example, if you have two columns of data, x and y, and you
think theres a linear relationship, you’d write down
y=β0+β1x.

• You don't know what β0 and β1 are in terms of actual

numbers yet, so they're the parameters.
• Other people prefer pictures and will first draw a diagram
of data flow, possibly with arrows, showing how things
affect other things or what happens over time.

• This gives them an abstract picture of the relationships before

choosing equations to express them.
How do you build a model?

• One place to start is exploratory data analysis

(EDA), which we will cover in a later section.

• This entails making plots and building intuition for

your particular dataset. EDA helps out a lot, as well as
trial and error and iteration
Data Science Modelling Steps
1. Define Your Objective
2. Collect Data
3. Clean Your Data
4. Explore Your Data
5. Split Your Data
6. Choose a Model
7. Train Your Model
8. Evaluate Your Model
9. Improve Your Model
10. Deploy Your Model
Define Your Objective
• First, define very clearly what problem you are going to solve.

• Whether that is a customer churn (which customers are likely to stop

using a company's products or services within a given period) prediction,
better product recommendations, or patterns in data, you first need to
know your direction.

• This should bring clarity to the choice of data, algorithms, and evaluation
metrics.
Collect Data
• Gather data relevant to your objective.

• This can include internal data from your company,

publicly available datasets, or data purchased from
external sources.

• Ensure you have enough data to train your model

effectively.
Clean Your Data
• Data cleaning is a critical step to prepare your dataset for
modeling.

• It involves handling missing values, removing duplicates, and

correcting errors.

• Clean data ensures the reliability of your model’s predictions.

Explore Your Data

• Data exploration, or exploratory data analysis

(EDA), involves summarizing the main characteristics of
your dataset.

• Use visualizations and statistics to uncover patterns,

anomalies, and relationships between variables.
Split Your Data

• Divide your dataset into training and testing sets.

• The training set is used to train your model, while the testing
set evaluates its performance.

• A common split ratio is 80% for training and 20% for testing.
Choose a Model

• Select a model that suits your problem type (e.g., regression,

classification) and data.

• Beginners can start with simpler models like linear

regression or decision trees before moving on to more
complex models like neural networks.
Train Your Model

• Feed your training data into the model.

• This process involves the model learning from the data,

adjusting its parameters to minimize errors.

• Training a model can take time, especially with large datasets

or complex models.
Evaluate Your Model

• After training, assess your model’s performance using the testing

set.

• Common evaluation metrics include accuracy, precision, recall,

and F1 score.

• Evaluation helps you understand how well your model will

perform on unseen data.
Improve Your Model

• Based on the evaluation, you may need to refine your model.

• This can involve tuning hyperparameters, choosing a different

model, or going back to data cleaning and preparation for

further improvements.
Deploy Your Model

• Once satisfied with your model’s performance, deploy

it for real-world use.

• This could mean integrating it into an application or

using it for decision-making within your organization.
Probability distributions
• Imagine you are a Data Analyst or someone making Machine Learning
models or working on algorithms or python scripts, and you need to
analyze trends.

• Still, you don’t have enough data set with you to analyze the trend in your
dataset.

• Through this class, let’s find a way to solve this problem using
probability distribution.
What is Probability Distribution?
• A probability distribution is a mathematical function that defines the
likelihood of different outcomes or values of a variable.

• This function is commonly represented by a graph or probability table,

and it provides the probabilities of various possible results of an experiment
or random phenomenon based on the sample space and the probabilities of
events.

• Probability distributions are fundamental in probability theory and

statistics for analyzing data and making predictions.
In Layman Terms

• A probability distribution is a way to show how likely different

outcomes are.

• Imagine you have a list of all possible outcomes of something

random, like rolling a die or picking a card from a deck.

• The probability distribution tells you how often you can expect
each outcome.
Example of Probability Distribution

• Suppose me as a teacher at a university. After checking assignments

for a week, I graded all the students. I gave these graded papers to a
data entry guy in the university and told him to create a spreadsheet
containing the grades of all the students. But the guy only stores the
grades and not the corresponding students.
He made another blunder; he missed a few
entries in a hurry, and we have no idea
whose grades are missing.
How to find the missing values?

• One way to find this out is by visualizing the grades

and seeing if you can find a trend in the data.
• The graph you plotted is called the frequency
distribution of the data.

• You see that there is a smooth curve-like

structure that defines our data, but do you
notice an anomaly? We have an abnormally low
frequency at a particular score range.

• So the best guess would be to have missing

values that remove the dent in the distribution.
The normal distribution is written as

The parameter μ is the mean and median and controls where the distribution is
centered (because this is a symmetric distribution), and the parameter σ
controls how spread out the distribution is. This is the general functional form
Types of Distributions
Here is a list of distributions types
• Bernoulli Distribution
• Uniform Distribution
• Binomial Distribution
• Normal or Gaussian Distribution
• Exponential Distribution
• Poisson Distribution
A bunch of continuous density functions (aka probability
distributions)
Fitting a model

• Fitting a model means that you estimate the parameters

of the model using the observed data.
Overfitting

• Overfit ting is the term used to mean that you used a

dataset to estimate the parameters of your model, but
your model isnt that good at capturing reality beyond
your sampled data.

Liburdi J Weld Head Manual
No ratings yet
Liburdi J Weld Head Manual
74 pages
NLP Lab Manual
83% (6)
NLP Lab Manual
56 pages
Job-Description-NWGACU CEO PDF
No ratings yet
Job-Description-NWGACU CEO PDF
5 pages
DEEP LEARNING LAB Manuals
No ratings yet
DEEP LEARNING LAB Manuals
55 pages
DDM Lab Manual 22-23
No ratings yet
DDM Lab Manual 22-23
53 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Cse205
No ratings yet
Cse205
25 pages
Ada Final
No ratings yet
Ada Final
37 pages
Lecture Zero_UIT-Data Sciednce
No ratings yet
Lecture Zero_UIT-Data Sciednce
18 pages
Vlsi Manual 1
No ratings yet
Vlsi Manual 1
261 pages
DAY-1 PPT of Advanced Programming Lab-II
No ratings yet
DAY-1 PPT of Advanced Programming Lab-II
55 pages
3140707_COA_Lab_Manual[1][1]
No ratings yet
3140707_COA_Lab_Manual[1][1]
60 pages
CD-304 DBMS Lab Manual.docx
No ratings yet
CD-304 DBMS Lab Manual.docx
22 pages
Amrita Ahead MSC Applied Statistics Data Analytics Program Project Report 2021
No ratings yet
Amrita Ahead MSC Applied Statistics Data Analytics Program Project Report 2021
11 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
155 pages
Supply Chain Management and Introduction To SAP (21ME641) Module 5
No ratings yet
Supply Chain Management and Introduction To SAP (21ME641) Module 5
40 pages
18csl58 Dbms Lab Manual 2022-23
No ratings yet
18csl58 Dbms Lab Manual 2022-23
72 pages
18 ML Lab Manual Final
No ratings yet
18 ML Lab Manual Final
30 pages
Supply Chain Management and Introduction To SAP (21ME641) Module 1
100% (1)
Supply Chain Management and Introduction To SAP (21ME641) Module 1
32 pages
Supply Chain Management and Introduction To SAP (21ME641) Module 4
No ratings yet
Supply Chain Management and Introduction To SAP (21ME641) Module 4
22 pages
Supply Chain Management and Introduction To SAP (21ME641) Module 2
No ratings yet
Supply Chain Management and Introduction To SAP (21ME641) Module 2
33 pages
RNSIT BCSL404 - ADA Lab Manual
0% (1)
RNSIT BCSL404 - ADA Lab Manual
32 pages
Is Sneha
No ratings yet
Is Sneha
78 pages
System Software 18csl66 - Ss and Os Lab Manual
No ratings yet
System Software 18csl66 - Ss and Os Lab Manual
117 pages
Business Intelligence
No ratings yet
Business Intelligence
14 pages
PSC Content
No ratings yet
PSC Content
16 pages
PPT 1.1.1 a. Data Structure_Introduction
No ratings yet
PPT 1.1.1 a. Data Structure_Introduction
21 pages
CS - Scheme 2023-24 (3rd Sem)
No ratings yet
CS - Scheme 2023-24 (3rd Sem)
43 pages
Lab Manual Software Architecture Cs701
No ratings yet
Lab Manual Software Architecture Cs701
31 pages
Course File: B.L.D.E.A's Vachana Pitamaha Dr. P.G. Halakatti College of Engineering & Technology, Vijayapur - 586 103
No ratings yet
Course File: B.L.D.E.A's Vachana Pitamaha Dr. P.G. Halakatti College of Engineering & Technology, Vijayapur - 586 103
60 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
150 pages
Labfile front page
No ratings yet
Labfile front page
8 pages
CSE 175 Credits Syllabus V & VI
No ratings yet
CSE 175 Credits Syllabus V & VI
51 pages
DSP lab manual-1
No ratings yet
DSP lab manual-1
65 pages
Library Management System Mini Project DBMS
No ratings yet
Library Management System Mini Project DBMS
89 pages
CCS356 OOSE -NOTES-Final
No ratings yet
CCS356 OOSE -NOTES-Final
114 pages
AIML Merged
No ratings yet
AIML Merged
275 pages
OOPM LAB MANUAL DATA SCIENCE
No ratings yet
OOPM LAB MANUAL DATA SCIENCE
27 pages
Final
No ratings yet
Final
43 pages
rudra_iot_practical
No ratings yet
rudra_iot_practical
79 pages
DATA VISUALIZATION lab manual based on syllabus
No ratings yet
DATA VISUALIZATION lab manual based on syllabus
46 pages
3140707 COA Lab Manual
No ratings yet
3140707 COA Lab Manual
55 pages
ATC Module 1 Notes 2022
No ratings yet
ATC Module 1 Notes 2022
32 pages
GCC QB 2017
No ratings yet
GCC QB 2017
42 pages
II-II-DBMS-COURSES FILE
No ratings yet
II-II-DBMS-COURSES FILE
207 pages
Database Management System 2807203
100% (1)
Database Management System 2807203
67 pages
3160714 Data Mining. Vbv Jjj Ldce Vgec (1) - Copy
No ratings yet
3160714 Data Mining. Vbv Jjj Ldce Vgec (1) - Copy
43 pages
Ddco Cse Manual
50% (2)
Ddco Cse Manual
100 pages
CSP367_1st Day Ppt
No ratings yet
CSP367_1st Day Ppt
61 pages
II Year Hand Book FInal
No ratings yet
II Year Hand Book FInal
77 pages
Bachelor of Technology Syll
No ratings yet
Bachelor of Technology Syll
3 pages
Iswa Gecg
No ratings yet
Iswa Gecg
38 pages
OS Lab Manual.docx (1)
No ratings yet
OS Lab Manual.docx (1)
50 pages
Ccw331 Lab Manual
No ratings yet
Ccw331 Lab Manual
102 pages
AngularJSLabManual Design
No ratings yet
AngularJSLabManual Design
48 pages
3_Common Format of Institute Vission to PO.docx
No ratings yet
3_Common Format of Institute Vission to PO.docx
4 pages
BCS303_OS Lab Manual
No ratings yet
BCS303_OS Lab Manual
45 pages
CSD4283 WebDesign&Development
No ratings yet
CSD4283 WebDesign&Development
206 pages
Fazli Bipin
No ratings yet
Fazli Bipin
24 pages
aryan87
No ratings yet
aryan87
46 pages
Full Stack Web Development Lab Manual
No ratings yet
Full Stack Web Development Lab Manual
58 pages
Research Evaluation Baseline
From Everand
Research Evaluation Baseline
IPMA
No ratings yet
Brochure 15 Reasons To Choose MHC Hydrocyclone 4437 08 21 en MNG Web
No ratings yet
Brochure 15 Reasons To Choose MHC Hydrocyclone 4437 08 21 en MNG Web
4 pages
Parametric Statistics
No ratings yet
Parametric Statistics
19 pages
La Suerte Cigar and Cigarette Factory, 123 SCRA 679 (1983)
No ratings yet
La Suerte Cigar and Cigarette Factory, 123 SCRA 679 (1983)
29 pages
The Pros and Cons of Valves in Automotive Exhaust Systems
No ratings yet
The Pros and Cons of Valves in Automotive Exhaust Systems
6 pages
CCHSStrategicPlan 2016-19
No ratings yet
CCHSStrategicPlan 2016-19
12 pages
Freshly Cosmetics
No ratings yet
Freshly Cosmetics
7 pages
Ifrs Edition: Prepared by Coby Harmon University of California, Santa Barbara Westmont College
No ratings yet
Ifrs Edition: Prepared by Coby Harmon University of California, Santa Barbara Westmont College
45 pages
2023 HO 10 - Legal & Judicial Ethics - Legal Ethics
100% (1)
2023 HO 10 - Legal & Judicial Ethics - Legal Ethics
25 pages
2021-10-31 Diploma Agriculture Plant Science Revised 2078
No ratings yet
2021-10-31 Diploma Agriculture Plant Science Revised 2078
153 pages
Course Instructor: Office: Office Hours: 11:10 A.M. - 02:40 P.M. by Appointment Contact: Website For Lecture Notes
No ratings yet
Course Instructor: Office: Office Hours: 11:10 A.M. - 02:40 P.M. by Appointment Contact: Website For Lecture Notes
2 pages
Introduction To Marketing: Dr. Rishav Raj Gupta
No ratings yet
Introduction To Marketing: Dr. Rishav Raj Gupta
32 pages
ShopNotes 89
100% (1)
ShopNotes 89
52 pages
Quotation: Therapeutics Chemical Research Corporation
No ratings yet
Quotation: Therapeutics Chemical Research Corporation
63 pages
Eastern Europe Analysis
No ratings yet
Eastern Europe Analysis
45 pages
End-To-End Testing Plan
No ratings yet
End-To-End Testing Plan
9 pages
DEBATE - FCE f.s:C1
No ratings yet
DEBATE - FCE f.s:C1
2 pages
Lite-On_Inverter_8000_Series_User_Manual
No ratings yet
Lite-On_Inverter_8000_Series_User_Manual
17 pages
2022 Annual Report en
No ratings yet
2022 Annual Report en
164 pages
Dell Inspiron 2200
No ratings yet
Dell Inspiron 2200
37 pages
The Definitive Drucker - Book Review
No ratings yet
The Definitive Drucker - Book Review
2 pages
Initiator Training Resources May 2021
No ratings yet
Initiator Training Resources May 2021
58 pages
Liberty University's 990 Income Tax Form
No ratings yet
Liberty University's 990 Income Tax Form
111 pages
Granite Parking Study Executive Summary Houston
No ratings yet
Granite Parking Study Executive Summary Houston
2 pages
World Dictionary of Foreign Expressions PDF
100% (3)
World Dictionary of Foreign Expressions PDF
444 pages
OB Chapter 3
No ratings yet
OB Chapter 3
62 pages
Wjec Catering Homework
100% (1)
Wjec Catering Homework
11 pages
TN 42 Dual Axis Accelerometer
No ratings yet
TN 42 Dual Axis Accelerometer
3 pages
An Official Demand Letter Has Been Sent From ASLI's End, Yet No Response Has Been Delivered From The Shipper
No ratings yet
An Official Demand Letter Has Been Sent From ASLI's End, Yet No Response Has Been Delivered From The Shipper
2 pages