0% found this document useful (0 votes)
7 views

Module 1 PPT

The document outlines the vision, mission, and educational objectives of the Department of Computer Science and Engineering at SVCE, Bengaluru, emphasizing the importance of program accreditation and stakeholder perception. It details the program outcomes and specific outcomes for graduates, focusing on skills in data science, data analysis, and visualization. Additionally, it provides an overview of course modules, learning resources, and the significance of data science in various fields, including healthcare and business.

Uploaded by

Divyaraj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 1 PPT

The document outlines the vision, mission, and educational objectives of the Department of Computer Science and Engineering at SVCE, Bengaluru, emphasizing the importance of program accreditation and stakeholder perception. It details the program outcomes and specific outcomes for graduates, focusing on skills in data science, data analysis, and visualization. Additionally, it provides an overview of course modules, learning resources, and the significance of data science in various fields, including healthcare and business.

Uploaded by

Divyaraj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

DATA SCIENCE AND

VISUALIZATION
18CS644
Why NBA ????
• Program Accreditation
• Washington Accord
• Branding & Bragging
• Stakeholder Perception
• No escape from statutory and regulatory bodies like
AICTE, MCI etc irrespective of institutional rankings
or status
• Required for AICTE Approval, Extension of Approval,
New programs and seat increase

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 2


Institute Vision and Mission
Our Vision
• To be a premier institute for addressing the challenges in global
perspective.
Our Mission
• M1: Nurture students with professional and ethical outlook to
identify needs, analyze, design and innovate sustainable solutions
through lifelong learning in service of society as individual or a
team.
• M2: Establish state-of-the-art Laboratories and Information
Resource centre for education and research.
• M3: Collaborate with Industry, Government Organization and
Society to align the curriculum and outreach activities.
28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 3
Department Vision and Mission
Our Vision
• To be a school of Excellence in Computing for Holistic Education and Research
Our Mission
• M1: Accomplish academic achievement in Computer Science and Engineering
through student-centered creative teaching learning, qualified faculty members,
assessment and effective usage of ICT.
• M2: Establish a Center of Excellence in a various verticals of computer science
and engineering to encourage collaborative research and Industry-institute
interaction.
• M3:Transform the engineering students to socially responsible, ethical,
technically competent and Value added professional or entrepreneur through
holistic education.
28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 4
Program Educational Objectives
• Knowledge: Computer Science and Engineering Graduates will have
professional technical career in inter disciplinary domains providing
innovative and sustainable solutions using modern tools.
• Skills: Computer Science and Engineering Graduates will have effective
communication, leadership, team building, problem solving, decision making
and creative skills.
• Attitude: Computer Science and Engineering Graduates will practice ethical
responsibilities towards their peers, employers and society.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 5


Program Specific Outcomes
• PSO 1: Ability to adopt quickly for any domain, interact with
diverse group of individuals and be an entrepreneur in a
societal and global setting.

• PSO 2: Ability to visualize the operations of existing and


future software Applications.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 6


Program Outcomes
• Engineering knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
• Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
• Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified
needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 7


Program Outcomes
• Conduct investigations of complex problems: Use research-based
knowledge and research methods including design of experiments,
analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
• Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and
modeling to complex engineering activities with an understanding of the
limitations.
• The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to the professional engineering
practice.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 8


Program Outcomes
• Environment and sustainability: Understand the impact of the
professional engineering solutions in societal and environmental
contexts, and demonstrate the knowledge of, and need for
sustainable development.
• Ethics: Apply ethical principles and commit to professional ethics
and responsibilities and norms of the engineering practice.
• Individual and team work: Function effectively as an individual, and
as a member or leader in diverse teams, and in multidisciplinary
settings.

28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 9


Program Outcomes
• Communication: Communicate effectively on complex
engineering activities with the engineering community and with
society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
• Project management and finance: Demonstrate knowledge and
understanding of the engineering and management principles and
apply these to one’s own work, as a member and leader in a team,
to manage projects and in multidisciplinary environments.
• Life-long learning: Recognize the need for, and have the
preparation and ability to engage in independent and life-long
learning in the broadest context of technological change.
28/04/2025 Department of CSE, SVCE, Bengaluru - 562157 10
Course Outcomes
At the end of the course the student will be able to:
CO 1. Understand the data in different forms
CO 2. Apply different techniques to Explore Data Analysis and the
Data Science Process.
CO 3. Analyze feature selection algorithms & design a recommender
system.
CO 4. Evaluate data visualization tools and libraries and plot graphs.
CO 5. Develop different charts and include mathematical expressions.
Syllabus
Module-1
Introduction to Data Science
Introduction: What is Data Science? Big Data and Data Science hype – and
getting past the hype, Why now? – Datafication, Current landscape of
perspectives, Skill sets. Needed Statistical Inference: Populations and samples,
Statistical modelling, probability distributions, fitting a model.
Module-2
Exploratory Data Analysis and the Data Science Process
Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA,
The Data Science Process, Case Study: Real Direct(online realestate firm).
ThreeBasic Machine LearningAlgorithms: Linear Regression, k-Nearest
Neighbours (k- NN), k-means.
Module-3
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention. Feature
Generation (brainstorming, role of domain expertise, and place for imagination), Feature
Selection algorithms. Filters; Wrappers; Decision Trees; Random Forests.
Recommendation Systems: Building a User-Facing Data Product, Algorithmic ingredients
of a Recommendation Engine, Dimensionality Reduction, Singular Value Decomposition,
Principal Component Analysis, Exercise: build your own recommendation system.
Module-4
Data Visualization and Data Exploration
Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools
and Libraries for Visualization Comparison Plots: Line Chart, Bar Chart and Radar Chart;
Relation Plots: Scatter Plot, Bubble Plot , Correlogram and Heatmap; Composition Plots:
Pie Chart, Stacked Bar Chart, Stacked Area Chart, Venn Diagram; Distribution Plots:
Histogram, Density Plot, Box Plot, Violin Plot; Geo Plots: Dot Map, Choropleth Map,
Connection Map; What Makes a Good Visualization?
Module-5
A Deep Dive into Matplotlib
Introduction, Overview of Plots in Matplotlib, Pyplot Basics: Creating
Figures, Closing Figures, Format Strings, Plotting, Plotting Using
pandas DataFrames, Displaying Figures, Saving Figures; Basic Text and
Legend Functions: Labels, Titles, Text, Annotations, Legends; Basic
Plots:Bar Chart, Pie Chart, Stacked Bar Chart, Stacked Area Chart,
Histogram, Box Plot, Scatter Plot, Bubble Plot; Layouts: Subplots, Tight
Layout, Radar Charts, GridSpec; Images: Basic Image Operations,
Writing Mathematical Expressions
Suggested Learning Resources
Textbooks

1. Doing Data Science, Cathy O’Neil and Rachel Schutt, O'Reilly Media, Inc O'Reilly Media, Inc,

2013

2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN

9781800568112

Reference:

1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman, Cambridge University

Press, 2010

2. Data Science from Scratch, Joel Grus, Shroff Publisher /O’Reilly Publisher Media
Module-1
Introduction to Data Science
• Data science is like being a detective, but instead of solving
crimes, you're solving puzzles hidden in data.

• Imagine you have a massive collection of puzzle pieces


scattered all over the place. Each piece represents a different
aspect of information, like numbers, words, or images.
Key Components of Data Science:
• Data Collection:
• Just like a detective gathers clues, data scientists collect different types of
information from various sources. This could be anything from customer
reviews, sensor readings, social media posts, or even weather data.

• Data Cleaning and Preparation:


• Once we have our puzzle pieces, we need to clean and organize them. Sometimes
pieces are missing or don't fit quite right, so we have to tidy up the data to
make sense of it. Think of this as sorting through the puzzle pieces, discarding the
ones that don't belong, and making sure everything is in the right order.
• Data Analysis:
• Now that we have our clean and organized data, it's time to analyze it. This
is where we start putting the puzzle together. We look for patterns, trends,
and insights that can help us understand the story behind the data.

• Data Visualization:
• Data visualization is like creating a picture of the solved puzzle. Instead of
just looking at rows and columns of numbers, we use charts, graphs, and
other visual tools to make the information easier to understand and interpret.
Examples:

• Netflix Recommendations:
• When Netflix suggests a movie or show you might like based on
what you've watched before, that's data science in action. It's
analyzing your viewing history (data), finding patterns, and
making predictions about what you might enjoy next.
• Predicting Weather:
• Weather forecasting relies on collecting and analyzing
data from various sources like satellites, weather
stations, and sensors. By crunching this data,
meteorologists can predict things like temperature,
precipitation, and storms.
Healthcare Analytics:

• Hospitals use data science to improve patient care


and outcomes. For example, analyzing patient
records can help identify trends in diseases, predict
potential outbreaks, or personalize treatment plans.
What is Data & Science?
• The term “data science” combines two key elements: “data” and
“science.”
Data:
• It refers to the raw information that is collected, stored, and
processed.
• In today’s digital age, enormous amounts of data are generated
from various sources such as sensors, social media, transactions,
and more.
• This data can come in structured formats (e.g., databases) or
unstructured formats (e.g., text, images, videos).
Science:

• It refers to the systematic study and investigation of


phenomena using scientific methods and principles.

• Science involves forming hypotheses, conducting


experiments, analyzing data, and drawing conclusions
based on evidence.
What is the difference between DS and ML?

• Data science provides the framework and tools for


extracting insights from data, while machine
learning is a subset of data science that focuses on
developing algorithms for automated learning and
prediction.
Big Data and Data
Science Hype
The topic says about some common concerns and
misconceptions surrounding data science,
especially regarding "Big Data" and its relationship
with traditional research fields like statistics.
• Lack of Clear Definitions: People often use terms like "Big Data" and "data
science" without clear definitions, making them seem meaningless or confusing.

• Lack of Respect for Traditional Researchers: There's a disregard for the decades
of work done by researchers in various fields like statistics, computer science,
and engineering, who laid the groundwork for modern data science.

• Excessive Hype: There's an exaggeration and hype around data science, leading to
unrealistic expectations and making it harder to see its real value.

• Overlap with Statistics: Data science is sometimes seen as just a rebranding of


statistics, which can be frustrating for statisticians who feel their field is being
• In simpler terms, the passage is saying that people often
talk about data science without really understanding what
it means, they don't appreciate the work that went into it
before, they hype it up too much, and they overlook the
connection with traditional fields like statistics.
Getting Past the Hype
• Rachel's Experience: Summary of Rachel's experience transitioning
from studying statistics to working at Google.
• Quote from Rachel: "It was clear to me pretty quickly that the stuff I
was working on at Google was different than anything I had learned at
school when I got my PhD in statistics.“
• Rachel's investigation into data science through meetings and teaching
a course at Columbia aimed to clarify the emerging field's meaning
and significance.
• Ultimately, the goal of the book is to help more people understand
the reality of data science.
Datafication

• Datafication refers to the process of converting various


aspects of human life and activities into digital data that can
be stored, analyzed, and utilized for various purposes.

• This includes transforming behaviors, interactions, and


transactions, both online and offline, into quantifiable data
points.
• Datafication enables the collection, processing, and
interpretation of vast amounts of information, often with
the aim of gaining insights, making predictions, and
driving decision-making in diverse fields such as business,
healthcare, education, and governance.
Example
• For instance, when a user "likes" a post on Facebook, this action is recorded as
data, contributing to the user's profile and providing insights into their
preferences and interests.

• Over time, as users continue to engage with the platform, their data
profiles become increasingly detailed and comprehensive.

• This data can then be analyzed and used by the platform to personalize user
experiences, recommend content, target advertisements, and optimize
engagement.
The Current Landscape (with a
Little History)

What is data science? Is it new, or is it just


statistics or analytics rebranded? Is it real, or is it
pure hype? And if its new and if its real, what
does that mean?
What is Data Science?
 Data science is using data to answer questions.
 Data science is the science of analyzing raw data using
statistics and machine learning techniques with the
purpose of drawing conclusions about that
information.
Drew Conways Venn diagram of data science from 2010
Data Science involves:
• Statistics, computer science,
mathematics
• Data cleaning and
formatting
• Data visualization
• Hacking skills refer to the ability to manipulate data efficiently
using programming languages and tools.
• Mathematical and statistical knowledge is essential for analyzing
and deriving insights from the data collected.
• Substantive expertise involves having a deep understanding of
the domain or field in which the data is being analyzed.
• These three components intersect to form the core of data
science, where successful practitioners can collect, clean,
analyze, and interpret data effectively.
• However, there is a danger zone where individuals may possess
hacking skills and substantive expertise but lack
understanding of mathematical and statistical concepts,
potentially leading to misleading or misinterpreted analysis
results.
Data Science Jobs
• The paragraph highlights the growing demand for data scientists, with Columbia
University establishing an Institute for Data Sciences and Engineering and numerous
job openings in New York City alone.

• It acknowledges that data science encompasses a wide range of skills, including


computer science, statistics, communication, data visualization, and domain
expertise.

• However, it emphasizes that it's unrealistic for one individual to be an expert in all these
areas, suggesting that building teams with diverse skill sets is more effective.

• This approach allows teams to specialize in different aspects of data science and
Skill Sets
• It is important to develop a diverse skill set as a data scientist,
particularly focusing on statistical thinking in the age of Big Data.

• It emphasizes that while foundational knowledge in statistics, linear


algebra, and programming is crucial, data scientists also need to
develop parallel skill sets in data preparation, modeling, coding,
visualization, and communication.

• These skills are interdependent and essential for effectively working


with data.
Statistical Thinking

Example: Analyzing Students' Performance in AI and ML Course.


(Only for your Understanding, Don’t Write in exam)

• Imagine I want to analyze how well your 5th-semester


students performed in the AI and ML course to improve my
current Data Science and Visualization teaching strategies.
Here’s how i can apply statistical thinking:
1. Data Collection
2. Descriptive Statistics
3. Data Visualization
4. Inferential Statistics
5. Correlation Analysis
6. Making Conclusions
7. Improving Teaching Strategies
8. Feedback Loop
Data Collection

• Gather the final grades of all students from the AI


and ML course last semester.
Descriptive Statistics
• Mean: Calculate the average grade to understand the overall performance level
of the class.

• Median: Identify the middle grade to get a sense of the central tendency without
extreme grades skewing the result.

• Mode: Determine the most frequently occurring grade, which can indicate a
common performance level.

• Range: Find the difference between the highest and lowest grades to understand
the spread of student performance.
Data Visualization

• Histogram: Create a histogram of the grades to visualize the


distribution. This can show how grades are spread across
different ranges.

• Box Plot: Use a box plot to display the median, quartiles, and
any potential outliers in the grades.
Inferential Statistics
• Standard Deviation: Calculate the standard deviation of the grades to
understand the variability. A low standard deviation means grades are close
to the mean, while a high standard deviation indicates more variation.

• Z-scores: Convert grades to z-scores to see how many standard deviations


each grade is from the mean, helping identify significantly high or low
performers.
Correlation Analysis

• Correlation with Attendance: Analyze if there is a correlation between


students' attendance and their grades. This can help you understand if
regular attendance is a significant factor in student performance.

• Correlation with Assignments: Check the correlation between students'


performance in assignments and their final grades. This can indicate if
consistent performance throughout the course impacts the final result.
Making Conclusions
• Identify trends, such as whether most students scored within a
certain range or if there are outliers who performed exceptionally
well or poorly.

• Determine if there are specific topics where students struggled,


based on clustering of lower grades around certain assignments
or exam questions.
Improving Teaching Strategies:

• Based on your analysis, decide if you need to adjust your


teaching methods, such as providing additional resources or
focusing more on difficult topics.

• Offer additional support or office hours for students who are


identified as outliers with low performance to help them catch
up.
Improving Teaching Strategies:

• Based on your analysis, decide if you need to adjust your


teaching methods, such as providing additional resources or
focusing more on difficult topics.

• Offer additional support or office hours for students who are


identified as outliers with low performance to help them catch
up.
Feedback Loop:
• Use the insights gained to give personalized feedback to students,
highlighting their strengths and areas for improvement.

• Incorporate findings into your current Data Science and


Visualization course to enhance teaching effectiveness and
student learning outcomes.
Statistical thinking
• Statistical thinking is the process of using data to understand the world,
make decisions, and solve problems.

• It involves:

1.Collecting Data: Gathering information from various sources.

2.Analyzing Data: Looking at the numbers to find patterns and trends.

3.Interpreting Data: Figuring out what the patterns and trends mean.

4.Making Decisions: Using the insights from the data to make informed choices.
Note:
• Statistical thinking is a way of understanding a
complex world by describing it in relatively
simple terms that nonetheless capture essential
aspects of its structure or function, and that also
provide us some idea of how uncertain we are
about that knowledge.
Statistical Inference
Statistical Inference

• Statistical inference is the process of drawing


conclusions or making predictions about a
population based on data collected from a
sample of that population.
In Layman Terms

• Statistical inference is a way of making educated


guesses about a large group based on a smaller
sample of data from that group
Populations and Samples

• In statistics, understanding the difference


between a population and a sample is
fundamental to many aspects of data analysis
and inference.
Population Vs Sample
Population

• In statistics, the population is the entire set of items


from which data is drawn in the statistical study.

• It can be a group of individuals or a set of items.

• The population is usually denoted by N.


Sample
• A sample is a subset of the population selected for study.

• It is a representative portion of the population from


which we collect data in order to make inferences or
draw conclusions about the entire population.

• It is denoted by n.
Population Vs Sample
Population Sample
The population includes all A sample is a subset of the
members of a specified group. population.
Collecting data from an entire Samples offer a more feasible
population can be time- approach to studying populations,
consuming, expensive, and allowing researchers to draw
sometimes impractical or conclusions based on smaller,
impossible. manageable datasets

Consists of 1000 households, a


Includes all residents in the city.
subset of the entire population.
Populations and Samples of Big Data
Populations

• Definition: A population is the entire group of items, individuals, or events


that you're interested in studying.

• Example: Imagine you want to understand the habits of all the users on a
social media platform. Here, all the users of the platform are your population.

• Big Data Context: In big data, a population can be extremely large. For
example, it could include all the tweets made on Twitter, all the transactions
made on Amazon, or all the rides taken using a ride-sharing app.
Samples
• Definition: A sample is a smaller group selected from the population. It's like
taking a small piece to understand the whole.

• Example: Instead of analyzing all the tweets ever made (which is the population),
you might look at a random selection of 10,000 tweets (this is your sample).

• Big Data Context: Even though big data technology can handle massive amounts
of information, analyzing a sample can still be useful. It allows you to make
inferences about the entire population without needing to process all the data,
which can be time-consuming and expensive.
Summary
• Even though we can record all data, we still use samples to make data handling
easier and to draw accurate conclusions.

• Big companies like Google often use samples because it's more efficient.

• The data we collect might not always represent the whole picture.

• For example, tweets during Hurricane Sandy (Hurricane Sandy was a powerful and
destructive storm that affected the Caribbean and the eastern United States in late
October 2012. It was one of the most significant hurricanes in recent history due to its
size, impact, and the damage it caused) mostly came from New Yorkers, giving a
skewed view of the event.
• Sampling helps avoid biases and makes data more manageable. Also,
even if we have all data from a company, it's still just one version of
what could have happened, and we use it to understand the bigger
picture of behaviors or trends.

• It's essential to remember that any conclusions drawn from samples


might not apply to the entire population without further investigation
and understanding.
Statistical Thinking vs Statistical Inference
Statistical Thinking

• Definition: Statistical thinking is the overall mindset and approach to


understanding, analyzing, and interpreting data to make informed decisions.

Statistical Inference

• Definition: Statistical inference is a specific aspect of statistical thinking that


involves making predictions or generalizations about a larger population
based on a sample of data.
Important Note on SAMPLE

• Selecting a good sample is important because it


ensures your research results are accurate and
truly represent the whole group you're studying.
Modelling

• Modeling refers to the process of creating simplified


representations of complex systems or phenomena to
aid understanding, prediction, or decision-making.
Model
• In simple terms, a model in data science is like a tool or a recipe.

• Imagine you're trying to bake a cake. You follow a recipe that tells you
what ingredients to use and how to mix them together.

• In data science, a model is similar—it's a set of rules or equations


that we use to understand or predict things based on data.
What is model?

• A model is our attempt to understand and represent the


nature of reality through a particular lens, be it
architectural, biological, or mathematical.

• A model is an artificial construction where all extraneous


detail has been removed or abstracted.
Statistical modelling

• Before you get too involved with the data and start coding,
its useful to draw a picture of what you think the underlying
process might be with your model.

• Statistical modeling is like making educated guesses about


the world around us using math and data.
• In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for
example, if you have two columns of data, x and y, and you
think theres a linear relationship, you’d write down
y=β0+β1x.

• You don't know what β0 and β1 are in terms of actual


numbers yet, so they're the parameters.
• Other people prefer pictures and will first draw a diagram
of data flow, possibly with arrows, showing how things
affect other things or what happens over time.

• This gives them an abstract picture of the relationships before


choosing equations to express them.
How do you build a model?

• One place to start is exploratory data analysis


(EDA), which we will cover in a later section.

• This entails making plots and building intuition for


your particular dataset. EDA helps out a lot, as well as
trial and error and iteration
Data Science Modelling Steps
1. Define Your Objective
2. Collect Data
3. Clean Your Data
4. Explore Your Data
5. Split Your Data
6. Choose a Model
7. Train Your Model
8. Evaluate Your Model
9. Improve Your Model
10. Deploy Your Model
Define Your Objective
• First, define very clearly what problem you are going to solve.

• Whether that is a customer churn (which customers are likely to stop


using a company's products or services within a given period) prediction,
better product recommendations, or patterns in data, you first need to
know your direction.

• This should bring clarity to the choice of data, algorithms, and evaluation
metrics.
Collect Data
• Gather data relevant to your objective.

• This can include internal data from your company,


publicly available datasets, or data purchased from
external sources.

• Ensure you have enough data to train your model


effectively.
Clean Your Data
• Data cleaning is a critical step to prepare your dataset for
modeling.

• It involves handling missing values, removing duplicates, and


correcting errors.

• Clean data ensures the reliability of your model’s predictions.


Explore Your Data

• Data exploration, or exploratory data analysis


(EDA), involves summarizing the main characteristics of
your dataset.

• Use visualizations and statistics to uncover patterns,


anomalies, and relationships between variables.
Split Your Data

• Divide your dataset into training and testing sets.

• The training set is used to train your model, while the testing
set evaluates its performance.

• A common split ratio is 80% for training and 20% for testing.
Choose a Model

• Select a model that suits your problem type (e.g., regression,


classification) and data.

• Beginners can start with simpler models like linear


regression or decision trees before moving on to more
complex models like neural networks.
Train Your Model

• Feed your training data into the model.

• This process involves the model learning from the data,


adjusting its parameters to minimize errors.

• Training a model can take time, especially with large datasets


or complex models.
Evaluate Your Model

• After training, assess your model’s performance using the testing


set.

• Common evaluation metrics include accuracy, precision, recall,


and F1 score.

• Evaluation helps you understand how well your model will


perform on unseen data.
Improve Your Model

• Based on the evaluation, you may need to refine your model.

• This can involve tuning hyperparameters, choosing a different

model, or going back to data cleaning and preparation for

further improvements.
Deploy Your Model

• Once satisfied with your model’s performance, deploy


it for real-world use.

• This could mean integrating it into an application or


using it for decision-making within your organization.
Probability distributions
• Imagine you are a Data Analyst or someone making Machine Learning
models or working on algorithms or python scripts, and you need to
analyze trends.

• Still, you don’t have enough data set with you to analyze the trend in your
dataset.

• Through this class, let’s find a way to solve this problem using
probability distribution.
What is Probability Distribution?
• A probability distribution is a mathematical function that defines the
likelihood of different outcomes or values of a variable.

• This function is commonly represented by a graph or probability table,


and it provides the probabilities of various possible results of an experiment
or random phenomenon based on the sample space and the probabilities of
events.

• Probability distributions are fundamental in probability theory and


statistics for analyzing data and making predictions.
In Layman Terms

• A probability distribution is a way to show how likely different


outcomes are.

• Imagine you have a list of all possible outcomes of something


random, like rolling a die or picking a card from a deck.

• The probability distribution tells you how often you can expect
each outcome.
Example of Probability Distribution

• Suppose me as a teacher at a university. After checking assignments


for a week, I graded all the students. I gave these graded papers to a
data entry guy in the university and told him to create a spreadsheet
containing the grades of all the students. But the guy only stores the
grades and not the corresponding students.
He made another blunder; he missed a few
entries in a hurry, and we have no idea
whose grades are missing.
How to find the missing values?

• One way to find this out is by visualizing the grades


and seeing if you can find a trend in the data.
• The graph you plotted is called the frequency
distribution of the data.

• You see that there is a smooth curve-like


structure that defines our data, but do you
notice an anomaly? We have an abnormally low
frequency at a particular score range.

• So the best guess would be to have missing


values that remove the dent in the distribution.
The normal distribution is written as

The parameter μ is the mean and median and controls where the distribution is
centered (because this is a symmetric distribution), and the parameter σ
controls how spread out the distribution is. This is the general functional form
Types of Distributions
Here is a list of distributions types
• Bernoulli Distribution
• Uniform Distribution
• Binomial Distribution
• Normal or Gaussian Distribution
• Exponential Distribution
• Poisson Distribution
A bunch of continuous density functions (aka probability
distributions)
Fitting a model

• Fitting a model means that you estimate the parameters


of the model using the observed data.
Overfitting

• Overfit ting is the term used to mean that you used a


dataset to estimate the parameters of your model, but
your model isnt that good at capturing reality beyond
your sampled data.

You might also like