0% found this document useful (0 votes)
30 views40 pages

DV - Unit 1

The document discusses various topics related to data including data, databases, data centers, data warehousing, data visualization, data collection methods like census, sampling, experimental, and observational. It also discusses statistics concepts like descriptive statistics, mean, median, mode and inferential statistics. The document then talks about random variables, probability distributions, and applications of data science.

Uploaded by

Narenkumar. N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views40 pages

DV - Unit 1

The document discusses various topics related to data including data, databases, data centers, data warehousing, data visualization, data collection methods like census, sampling, experimental, and observational. It also discusses statistics concepts like descriptive statistics, mean, median, mode and inferential statistics. The document then talks about random variables, probability distributions, and applications of data science.

Uploaded by

Narenkumar. N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

SUBHASHREE K

SOFTWARE TECHNICAL TRAINER


IBM

Data
Visualization
UNIT I

INTRODUCTION TO STATISTICS
DATA
Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things.
Collective numbers of information's are known as Data.
Relative to today's computers and transmission media, data is
information converted into binary digital form.
Ex:- Name and address is information.
Name and all the things like Phone number, Address, Gender ,Father’s
name are Data
DATABASE
A database is an organized collection of structured information, or
data, typically stored electronically in a computer system.
The data can then be easily accessed, managed, modified, updated,
controlled, and organized.
Most databases use structured query language (SQL) for writing
and querying data.
Ex: Student Database, Employee Database
DATABASE
Ex:

Data definition language (DDL)

Data manipulation language (DML)

Data control language (DCL)

Transaction control language (TCL)

SQL.

XQuery.

OQL.
DATA CENTER
A Data center is a physical facility that organizations use to house
their critical applications and data.
A data center's design is based on a network of computing and
storage resources that enable the delivery of shared applications
and data.
The key components of a data center design include routers,
switches, firewalls, storage systems, servers, and
application-delivery controllers
DATA CENTER
DATA WAREHOUSE
A data warehouse is a central repository of information that can be
analyzed to make more informed decisions.
Data flows into a data warehouse from transactional
systems, relational databases, and other sources.
Business analysts, data engineers, data scientists, and decision
makers access the data through business intelligence (BI) tools
DATA WAREHOUSE
Informed decision making
Informed decision making
Consolidated data from many sources
Consolidated data from many sources
Historical data analysis.
Historical data analysis. Data quality, consistency, and accuracy.
Separation of analytics processing from transactional databases, which improves performance of both systems.
Data quality, consistency, and accuracy.
ETL- Extract, Transform and load

Separation of analytics processing from transactional databases,


which improves performance of both systems.
ETL- Extract, Transform and load
DATA VISUALIZATION
In this technical world, a lot of data is being generated on a day by day.
And sometimes to analyze this data for certain trends, patterns may
become difficult if the data is in its raw format.
Data visualization provides a good, organized pictorial representation of
the data which makes it easier to understand, observe, analyze.
Communicating the information correctly too the required target
peoples.
Eg:-Cricket score of batsman
DATA COLLECTION
Data represents information collected in the form of numbers and
text.
Data collection is generally done after the experiment or
observation.
Primary data and Secondary data are helpful in planning and
estimating.
Data collection is either qualitative or quantitative.
DATA COLLECTION
Census data collection
Sample data collection
Experimental data collection
Observational data collection
CENSUS DATA COLLECTION
Census data collection is a method of collecting data whereby
all the data from every member of the population is collected.
SAMPLE DATA COLLECTION
Sample data collection, which is commonly just referred to as sampling,
is a method which collects data from only a chosen portion of the
population.
Sampling is used commonly in everyday life.
for example, all the different research polls that are conducted before
elections. Pollsters don't ask all the people in a given state who they'll
vote for, but they choose a small sample and assume that these people
represent how the entire population of the state is likely to vote.
SAMPLE DATA COLLECTION
History has shown that these polls are almost always close to accuracy,
and as such sampling is a very powerful tool in statistics.
EXPERIMENTAL DATA COLLECTION
Experimental data collection involves one performing an experiment and
then collecting the data to be further analyzed. Experiments involve tests
and the results of these tests are your data.
An example of experimental data collection is rolling a die one hundred
times while recording the outcomes. Your data would be the results you
get in each roll. The experiment could involve rolling the die in different
ways and recording the results for each of those different ways.
EXPERIMENTAL DATA COLLECTION
OBSERVATIONAL DATA COLLECTION
Observational data collection method involves not carrying
out an experiment but observing without influencing the
population at all.
Observational data collection is popular in studying trends
and behaviors of society where, for example, the lives of a
bunch of people are observed and data is collected for the
different aspects of their lives. Analysis of data collected in
such ways can broadly categorized into 2 categories called
descriptive and inferential statistics.
OBSERVATIONAL DATA COLLECTION
STATISTICS
Statistics is a mathematical science including methods of
collecting, organizing and analyzing data in such a way that
meaningful conclusions can be drawn from them.
Data can be defined as groups of information that represent
the qualitative or quantitative attributes of a variable or set of
variables.
An example of data can be the ages of the students in a given
class. When you collect those ages, that becomes your data.
DESCRIPTIVE STATISTICS
Descriptive statistics deals with the processing of data without
attempting to draw any inferences from it. The data are
presented in the form of tables and graphs. The characteristics
of the data are described in simple terms. Events that are dealt
with include everyday happenings such as accidents, prices of
goods, business, incomes, epidemics, sports data, population
data.
DESCRIPTIVE STATISTICS
MEAN

The mean value is the average value of the whole data.


To calculate the mean, find the sum of all values, and divide the
sum by the number of values.
MEDIAN
The median value is the value in the middle.
Prefect partition value.
MODE
Mode value is the value that appears the most number of times in
the dataset.
Calculating the highest repeated value in the data set.
INFERENTIAL STATISTICS

The objective of making inference from data is to make intelligent


assertion like
1. People who don’t smoke live longer than people who smoke.
2. 80% of all vehicle in USA are 4 wheelers.

1. P
INFERENTIAL STATISTICS

In our regular life, we make decision driven by data.


It is always better idea to back our decision. In case if we don’t have
data to back our decision, it can be easy that we can make wrong
conclusion.
It is a tangible way which you can use to defend yourself from the
consequence of a decision which was correct based on information
available at the point when decision was made, and which then went
wrong later.

1. P
RANDOM VARIABLE

Whose value cannot be determined before an event happens.

Example:-
1. A person’s blood type.
2. Number of leaves on a tree.
3. Number of times a user visits Linked in a day.

1. P
PROBABILITY DISTRIBUTIONS

Probability distributions are functions that calculates the


probabilities of the outcomes of random variables.
Typical examples of random variables are coin tosses and dice
rolls.
1. P
PROBABILITY DISTRIBUTIONS

Normally distributed data can be transformed into a standard normal


distribution.
Standardizing normally distributed data makes it easier to compare
different sets of data.
The standard normal distribution is used for:
1. Calculating confidence intervals
1. P

2. Hypothesis tests
DATA SCIENCE

Data Science is the area of study which involves extracting insights


from vast amounts of data using various scientific methods, algorithms,
and processes. It helps you to discover hidden patterns from the raw
data. The term Data Science has emerged because of the evolution of
mathematical statistics, data analysis, and big data.
P
Why Data Science?
Data Science can help you to detect fraud using advanced machine
learning algorithms
It helps you to prevent any significant monetary losses
Allows to build intelligence ability in machines
You can perform sentiment analysis to gauge customer brand loyalty
It enables you to take better and faster decisions
It helps you to recommend the right product to the right customer to
enhance your business
P
Why Data Science?
APPLICATION

Internet Search: Google search uses Data science technology to


search for a specific result within a fraction of a second
Recommendation Systems: To create a recommendation system. For
example, “suggested friends” on Facebook or suggested videos” on
YouTube, everything is done with the help of Data Science.
Image & Speech Recognition: Speech recognizes systems like Siri,
Google Assistant, and Alexa run on the Data science technique.
Moreover, Facebook recognizes your friend when you upload a photo
with them, with the help of Data Science.
APPLICATION

Gaming world: EA Sports, Sony, Nintendo are using Data science


technology. This enhances your gaming experience. Games are now
developed using Machine Learning techniques, and they can update
themselves when you move to higher levels.
Online Price Comparison: PriceRunner, Junglee, Shopzilla work on
the Data science mechanism. Here, data is fetched from the relevant
websites using APIs.
CHALLENGES

A high variety of information & data is required for accurate analysis


Not adequate data science talent pool available
Management does not provide financial support for a data science team
Unavailability of/difficult access to data
Business decision-makers do not effectively use data Science results
Explaining data science to others is difficult
Privacy issues
Lack of significant domain expert
If an organization is very small, it can’t have a Data Science team
DATA PREPROCESSING

A preliminary processing of data in order to prepare it for the primary


processing or for further analysis.
The term can be applied to any first or preparatory processing stage
when there are several steps required to prepare data for the user.
For example, extracting data from a larger set, filtering it for various
reasons and combining sets of data could be preprocessing steps. See
preprocessor and compiler directive.
STEPS IN DATA PREPROCESSING
STEPS IN DATA PREPROCESSING

Data transformation: Here, data scientists think about how different aspects of the
data need to be organized to make the most sense for the goal. This could include
things like structuring unstructured data, combining salient variables when it makes
sense or identifying important ranges to focus on.
Data enrichment: In this step, data scientists apply the various feature engineering
libraries to the data to effect the desired transformations. The result should be a data
set organized to achieve the optimal balance between the training time for a new
model and the required compute.
Data validation: At this stage, the data is split into two sets. The first set is used to
train a machine learning or deep learning model. The second set is the testing data
that is used to gauge the accuracy and robustness of the resulting model. This
second step helps identify any problems in the hypothesis used in the cleaning and
feature engineering of the data.

You might also like