Module 1 PPT
Module 1 PPT
VISUALIZATION
18CS644
Why NBA ????
• Program Accreditation
• Washington Accord
• Branding & Bragging
• Stakeholder Perception
• No escape from statutory and regulatory bodies like
AICTE, MCI etc irrespective of institutional rankings
or status
• Required for AICTE Approval, Extension of Approval,
New programs and seat increase
1. Doing Data Science, Cathy O’Neil and Rachel Schutt, O'Reilly Media, Inc O'Reilly Media, Inc,
2013
2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN
9781800568112
Reference:
1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman, Cambridge University
Press, 2010
2. Data Science from Scratch, Joel Grus, Shroff Publisher /O’Reilly Publisher Media
Module-1
Introduction to Data Science
• Data science is like being a detective, but instead of solving
crimes, you're solving puzzles hidden in data.
• Data Visualization:
• Data visualization is like creating a picture of the solved puzzle. Instead of
just looking at rows and columns of numbers, we use charts, graphs, and
other visual tools to make the information easier to understand and interpret.
Examples:
• Netflix Recommendations:
• When Netflix suggests a movie or show you might like based on
what you've watched before, that's data science in action. It's
analyzing your viewing history (data), finding patterns, and
making predictions about what you might enjoy next.
• Predicting Weather:
• Weather forecasting relies on collecting and analyzing
data from various sources like satellites, weather
stations, and sensors. By crunching this data,
meteorologists can predict things like temperature,
precipitation, and storms.
Healthcare Analytics:
• Lack of Respect for Traditional Researchers: There's a disregard for the decades
of work done by researchers in various fields like statistics, computer science,
and engineering, who laid the groundwork for modern data science.
• Excessive Hype: There's an exaggeration and hype around data science, leading to
unrealistic expectations and making it harder to see its real value.
• Over time, as users continue to engage with the platform, their data
profiles become increasingly detailed and comprehensive.
• This data can then be analyzed and used by the platform to personalize user
experiences, recommend content, target advertisements, and optimize
engagement.
The Current Landscape (with a
Little History)
• However, it emphasizes that it's unrealistic for one individual to be an expert in all these
areas, suggesting that building teams with diverse skill sets is more effective.
• This approach allows teams to specialize in different aspects of data science and
Skill Sets
• It is important to develop a diverse skill set as a data scientist,
particularly focusing on statistical thinking in the age of Big Data.
• Median: Identify the middle grade to get a sense of the central tendency without
extreme grades skewing the result.
• Mode: Determine the most frequently occurring grade, which can indicate a
common performance level.
• Range: Find the difference between the highest and lowest grades to understand
the spread of student performance.
Data Visualization
• Box Plot: Use a box plot to display the median, quartiles, and
any potential outliers in the grades.
Inferential Statistics
• Standard Deviation: Calculate the standard deviation of the grades to
understand the variability. A low standard deviation means grades are close
to the mean, while a high standard deviation indicates more variation.
• It involves:
3.Interpreting Data: Figuring out what the patterns and trends mean.
4.Making Decisions: Using the insights from the data to make informed choices.
Note:
• Statistical thinking is a way of understanding a
complex world by describing it in relatively
simple terms that nonetheless capture essential
aspects of its structure or function, and that also
provide us some idea of how uncertain we are
about that knowledge.
Statistical Inference
Statistical Inference
• It is denoted by n.
Population Vs Sample
Population Sample
The population includes all A sample is a subset of the
members of a specified group. population.
Collecting data from an entire Samples offer a more feasible
population can be time- approach to studying populations,
consuming, expensive, and allowing researchers to draw
sometimes impractical or conclusions based on smaller,
impossible. manageable datasets
• Example: Imagine you want to understand the habits of all the users on a
social media platform. Here, all the users of the platform are your population.
• Big Data Context: In big data, a population can be extremely large. For
example, it could include all the tweets made on Twitter, all the transactions
made on Amazon, or all the rides taken using a ride-sharing app.
Samples
• Definition: A sample is a smaller group selected from the population. It's like
taking a small piece to understand the whole.
• Example: Instead of analyzing all the tweets ever made (which is the population),
you might look at a random selection of 10,000 tweets (this is your sample).
• Big Data Context: Even though big data technology can handle massive amounts
of information, analyzing a sample can still be useful. It allows you to make
inferences about the entire population without needing to process all the data,
which can be time-consuming and expensive.
Summary
• Even though we can record all data, we still use samples to make data handling
easier and to draw accurate conclusions.
• Big companies like Google often use samples because it's more efficient.
• The data we collect might not always represent the whole picture.
• For example, tweets during Hurricane Sandy (Hurricane Sandy was a powerful and
destructive storm that affected the Caribbean and the eastern United States in late
October 2012. It was one of the most significant hurricanes in recent history due to its
size, impact, and the damage it caused) mostly came from New Yorkers, giving a
skewed view of the event.
• Sampling helps avoid biases and makes data more manageable. Also,
even if we have all data from a company, it's still just one version of
what could have happened, and we use it to understand the bigger
picture of behaviors or trends.
Statistical Inference
• Imagine you're trying to bake a cake. You follow a recipe that tells you
what ingredients to use and how to mix them together.
• Before you get too involved with the data and start coding,
its useful to draw a picture of what you think the underlying
process might be with your model.
• This should bring clarity to the choice of data, algorithms, and evaluation
metrics.
Collect Data
• Gather data relevant to your objective.
• The training set is used to train your model, while the testing
set evaluates its performance.
• A common split ratio is 80% for training and 20% for testing.
Choose a Model
further improvements.
Deploy Your Model
• Still, you don’t have enough data set with you to analyze the trend in your
dataset.
• Through this class, let’s find a way to solve this problem using
probability distribution.
What is Probability Distribution?
• A probability distribution is a mathematical function that defines the
likelihood of different outcomes or values of a variable.
• The probability distribution tells you how often you can expect
each outcome.
Example of Probability Distribution
The parameter μ is the mean and median and controls where the distribution is
centered (because this is a symmetric distribution), and the parameter σ
controls how spread out the distribution is. This is the general functional form
Types of Distributions
Here is a list of distributions types
• Bernoulli Distribution
• Uniform Distribution
• Binomial Distribution
• Normal or Gaussian Distribution
• Exponential Distribution
• Poisson Distribution
A bunch of continuous density functions (aka probability
distributions)
Fitting a model