0% found this document useful (0 votes)

21 views69 pages

CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets

The document provides an overview of data types and datasets in data science, explaining the concepts of instances, features, and data representation. It categorizes data types into simple (numeric, boolean, strings) and compound types (dates, lists, dictionaries), and discusses how data is stored in various formats. Additionally, it highlights common issues in data management, such as missing values and data representation challenges, while also suggesting resources for finding datasets.

Uploaded by

Mert Akgüç

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views69 pages

CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets

Uploaded by

Mert Akgüç

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

CENG313 Introduction to Data

Science
Lecture 3-4: Data Types and Datasets

Instructor: Assist. Prof. Ceren Güzel Turhan

Datasets = instances + features
n Datasets consists of instances (also known as examples or objects or observations)
q e.g., in a university database: students, professors, courses, grades,…
q e.g., in a library database: books, users, loans, publishers, ….
q e.g., in a movie database: movies, actors, director,…

n Instances are described through features (also known as attributes or variables or dimensions)
q E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
n The feedback feature (for supervised learning) is called the class attribute
Data
n Data can often be represented or abstracted as an D= n×d data matrix
q n rows corresponding to instances
q d columns correspond to features, feature set F

n The number of instances n is referred to as the size or cardinality of the dataset, n=lDl
n The number of features d is referred to as the dimensionality of the dataset
n Subset of the data: D’⊆ D
n Subspace F’⊆ F
q Subspace projection
An example from the iris dataset
What type ofdata?
What kind of values are in your data (data types)?

Simple or atomic:
• Numeric: integers, floats
• Boolean: binary or true falsevalues
• Strings: sequence of symbols
What type ofdata?
What kind of values are in your data (data types)? Compound,
composed of a bunch ofatomic types:
• Date and time: compound value with a specificstructure
• Lists: a list is a sequence ofvalues
• Dictionaries: A dictionary is a collection of key-value pairs, a pair
of values x : y where x is usually a string called the key
representing the “name” of the entry, and y is a value of any type.
Example: Student record
• First: Kevin
• Last: Rader
• Classes: [CS-109A, STAT139]
How is the datastored?
How is your data represented and stored (data format)?
• Tabular Data: a dataset that is a two-dimensional table, where
each row typically represents a single data record, and each
column represents one type of measurement(csv, dat, xlsx, etc.).
• Structured Data: each data record is presented in a form of a
[possibly complex and multi-tiered] dictionary (json, xml,etc.)
• Semistructured Data: not all records are represented by thesame
set of keys or some data records are not represented using the
key-value pair structure.
How is the datastored?
In tabulardata,weexpecteachrecordor observation torepresenta setof
measurementsof a singleobjector event.
How is the datastored?

Each type of measurement is called a variable or an attribute of the data (e.g.

seq_id, status and duration are variables or attributes). The number of
attributes is called the dimension. These are often called features.
We expect each table to contain a set of records or observations of the same kind
of object or event (e.g. our table above contains observations of rides/checkouts).
How is the datastored?
We’ll see later that it’s important to distinguish between classes of
variables or attributes based on the type of values they can take on.
• Quantitative variable: is numerical andcan be either:
• discrete - a finite number of values are possible in any bounded
interval. For example: “Number of siblings” is a discrete variable
• continuous - an infinite number of values are possible in any
bounded interval. For example: “Height” is a continuousvariable
• Categorical variable: no inherent order among the values For example:
“What kind of pet you have” is a categorical variable
Basic feature types
n Binary/ Dichotomous variables
n Categorical (qualitative)
q Binary variables
q Nominal variables
q Ordinal variables

n Numeric variables (quantitative)

q Interval-scale variables
q Ratio-scaled variables
Binary/ Dichotomous variables
n The attribute can take two values, {0,1} or {true,false}
q usually, 0 means absence, 1 means presence
q e.g., smoker variable: 1à smoker, 0à non-smoker
n Are both values equally important?
q Symmetric binary: both outcomes are equally important
n e.g., gender (male, female)
q Asymmetric binary: outcomes are not equally important
n e.g., medical tests (positive vs. negative)
n Convention: assign 1 to most important outcome
(e.g., HIV positive)
What are the binary variables in the example below?
Categorical: Nominal variables
n The attribute can take values within a set of M categories/ states (binary variables are a special
case)
q No ordering in the categories/ states.
q Only distinctness relationships apply, i.e., What are the categorical variables in the example below?
n equal (=) and
n different (≠)

q Examples:
n Colors = {brown, green, blue,…,gray},
n Occupation = {engineer, doctor, teacher, …, driver}
Categorical: Ordinal variables
n Similar to nominal variables, but the M states are ordered/ ranked in a meaningful way.
q Allows to apply order relationships, i.e., >, ≥, <, ≤
q However, the difference and ratio between these values has no meaning.
o E.g., 5*-3* is the same as 3*-1* or, 4* is 2 times better than 2*?

q Examples:
o School grades: {A,B,C,D,F}, {AA, BA, BB, CB, CC, DC, DD, FD, FF}, {A+, A, B+, B, C+, C, D,F}

o Movie ratings: {hate, dislike, indifferent, like, love}, {*, **, ***, ****, *****}, {1, 2, 3, 4, 5}

o Medals = {bronze, silver, gold}

What are the categorical variables in the example below?
Numeric: Interval-scale variables
n Differences between values are meaningful
q The difference between 90o and 100o temperature is the same as the difference between
40o and 50o temperature.

n Examples:
q Calendar dates , Temperature in Farenheit or Celsius, ...

n Ratio still has no meaning

q A temperature of 2o Celsius is not much different than a temperature of 1o
Celsius.
q The issue is that the 0o point of the Celsius scale is in a physical sense arbitrary
and therefore the ratio of two Celsius temperatures is not physically meaningful.
Numeric: Ratio-scale variables
n Both differences and ratios have a meaning
q E.g., a 100 kgs person is twice heavy as a 50 kgs person.
q E.g., a 50 years old person is twice old as a 25 years old person.

n Meaningful (unique and non-arbitrary) zero value

n Examples:
q age, weight, length, number of sales
q temperature in Kelvin
n When measured on the Kelvin scale, a temperature of 2o is, in a physical meaningful way, twice that of a 1o.
q The zero value is absolute 0, represents the complete absence of molecular motion

What are the ratio-scale variables in the example below?

Nominal, ordinal, interval-scale, ratio-scale
variables: overview of operations

Source: https://www.sagepub.com/sites/default/files/upm-binaries/19708_6.pdf
Common Issues
A few good generic questions to ask are as follows:
● How big is thedataset?
● Is this the entiredataset?
● Is this data representative enough? For example, maybe data was only collected for a subset of users.
●Are there likely to be gross outliers or extraordinary sources of noise? For example, 99% of the traffic from
a web server might be a single denial-of-service attack.
● Might there be artificial data inserted into the dataset? This happens a lot in industrial settings.
●Are there any fields that are unique identifiers? These are the fields you
might use for joining between datasets,etc.
● Are the supposedly unique identifiers actually unique? What does it mean if they aren’t?
● If there are two datasets A and B that need to be joined, what does it mean if something in A doesn’t
matching anything in B?
● When data entries are blank, where does that come from?
● How common are blankentries?
Common Issues

Common issues withdata:

• Missing values: how do we fill in?
• Wrong values: how can we detect and correct?
• Messy format
• Not usable: the data cannot answer the question posed
Common Issues
•The following is a table accounting for the number of produce
deliveries over a weekend.
•What are the variables in this dataset? What object or event are we
measuring?

What’s the issue? How do we fix it?

Common Issues
We’re measuring individual deliveries; the variables are Time, Day,
Number of Produce.

Problem: each column header represents a single value rather than a

variable. Row headers are “hiding” the Day variable. The values of the
variable, “Number of Produce”, is not recorded in a single column.
Common Issues – Now itis better!
We need to reorganize the information to make explicit the event
we’re observing and the variables associated to this event.
Common Issues – Now itis better!
Common causes of messydata are:
• Column headers are values, notvariable names
• Variables are stored in both rows and columns
• Multiplevariables are stored in one column/entry
• Multiple types of experimental units stored in same table
In general, we want each file to correspond to a dataset, each column
to represent a single variable and each row to represent a single
observation.
We want to tabularize the data. This makes Python happy.
Where to findDatasets?
https://www.kaggle.com/c/titanic/data
Where to findDatasets?
Where to findDatasets?
UCI Machine Learning Repository - 559data sets https://archive.ics.uci.edu/ml/index.php
Where to findDatasets?
Where to findDatasets?
Where else to findDatasets?
Academic Torrents
r/datasets
datahub.io
aws.amazon.com/datasets
Kaggle Data Sources
Kaggle Datasets
World Bank Data
NYC Taxidata
Open Data Philly Connecting people with data forPhiladelphia
National Climatic Data Center -NOAA
ClimateData.us
UNICEF Data
undata
NASA SocioEconomic Data and Applications Center -SEDAC
San Fransisco Government OpenData
The Internet Movie Database (IMDb)
Crowdsourced and curated data about all aspects of the motion picture industry,
at www.imdb.com
• data on over 3.3 million movies and TV programs
• For each film, IMDb includes:
• its title, running time, genres, date of release, and a full list of cast and crew.
• financial data about each production, including the budget for making the film
and how well it did at the box office.
• ratings for each film from viewers and critics (scores on a zero to ten stars
scale)
• written reviews
• links between films: for example, identifying which other films have been
watched most often by viewers of a film

Ref: The Data Science Design Manual by Steven Skiena

The Internet Movie Database (IMDb)

Ref: www.imdb.com
The Internet Movie Database (IMDb)

Ref: www.imdb.com
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

Questions

Data
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

Questions

Data
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?
Problem Solving

Questions
Identify

Data
Collecting Visualize
Curating Analyze
Cleaning Model (Machine Learning) Does the past represent the future?
What do I want to model?
How will the model be used?
What data do I need?
What data do I have?
How hard is it to get data?
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

•Which actors appeared in the most films? Appeared in the lowest rated
films? Had the longest career or the shortest lifespan?

•What was the highest rated film each year, or the best in each genre?
Which movies lost the most money, had the highest-powered casts, or
got the least favorable reviews.

Ref: The Data Science Design Manual by Steven Skiena

The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

•How well does movie gross correlate with viewer ratings or

awards?

•How do Hollywood movies compare to Korean movies, in

terms of ratings, budget, and gross? Are foreign movies
better received than American films, and how does this
differ between U.S. and non-U.S. reviewers?

Ref: The Data Science Design Manual by Steven Skiena

The Internet Movie Database
(IMDb)
What kind of questions can you answer with this data?

• What is the age distribution of actors and actresses in films? How much
younger is the actress playing the wife, on average, than the actor
playing the husband? Has this disparity been increasing or decreasing
with time?

•In which movies did Tom Hanks played with

Keanu Reeves?

Ref: The Data Science Design Manual by Steven Skiena

Why Python?
Why Python?
Why Python?
•Python is the most popular language for data scientists. R is its only
major competitor, at least when it comes tofree tools.
• Graphical package such as Excel or Tableau available, but not as flexible as
code!
• High-level scripting language
• A jackof-all-trades
• If you only need to worry about statistics, or numerical computation, or
web parsing, then there are better options out there. But if you need to do
all of these things within a single project, then Python is your best option.
Since data science is so inherently multidisciplinary, this makes it a perfect
fit.
• Open source frameworks and libraries!
• Scientific Computing with Python (Numpy, Scipy,Scikit-Learn)
Why Python?
Other Programming Languages for DataScience:

• R
• designed by and for statisticians, and it is natively integrated with graphics
capabilities and extensive statisticalfunctions
• Matlab and Octave
• physics or mechanical/electrical engineering
• Not open source
• SAS (Statistical Analysis Software)
• business statistics applications
• Sociology
• Scala
• up-and-coming language that shows a lot of promise
• doesn’t have the library support for analytics and visualizations
Why Python?
Some tools besides programming languages that might be handy in general:

• Excel: Microsoft products often get a bad rap in the data science world, and it is
completely
undeserved. For simple data analysis, Excel is probably the best piece of software ever
made.
• Tableau: This is a tool for visualizing the data in relational databases. It’s pretty
limited in its functionality, but when it works, the graphics arebeautiful.
• Weka: This is a tool for applying pre-canned machine learning algorithms to datasets
that are already well formatted and contain the relevant features. An advantage of
Weka is that it’s really just a thin GUI wrapper around some Java libraries, so it’s easy
to use the same models in your exploratory analysis and later production code
(assuming that you work in Java).
Why Python for data science?
Guido Van Rossum – the Zen of Python:

Python’s Benevolent Dictator for Life

Why Python for data science?
Guido Van Rossum – the Zen of Python:

Whitespace instead of symbols

• tabs, indentation and line-breaks matter
• code remains uncluttered

Variable types determined automatically

• no need to declare the type of your variables
before assigning values

Intuitive grammar
• PEP8: style guide

Python’s Benevolent Dictator for Life

Three advantages:
1. Python is
popular

• Large user community

• Well-maintained libraries

• Online guidance
(StackOverflow)
2. Easy to learn and share

WHY PEOPLE LIKE IT:

• Code is intuitive and

expressive (compare C++)

• Suited to large quantities of

data

• Transparent, reproducible
research through Jupyter
Notebooks
3. Thriving ecosystem of tools

Modeling Evaluate
Data
science
Get data Clean data and and
work-flow analysis present

Example
• BeautifulSoup
libraries • mySQL client • Pandas • Numpy • Jupyter
• API clients • Geopandas • scipy Notebook
(Twitter, ESRI, • Rasterio • statsmodel • Matplotlib
OSMNx…) • SciKitLearn • Flask
Python Development Environments and Code
Editors
• Python IDEs (Integrated Development Environments) specialized software applications designed
for Python development.
• PyCharm, Visual Studio Code with Python extensions, Spyder
• Jupyter Notebook
• Text Editors, Command Line/Terminal: text editors like Notepad (on Windows), TextEdit (on
macOS), or any code-focused text editor (e.g., Sublime Text, Atom, VSCode without specific
Python extensions).
• Online Python Editors/IDEs: Some online platforms directly in our web browser.
• Repl.it, PythonAnywhere, Google Colab (for Jupyter notebooks), Codecademy's Python coding environment, Kaggle
kernel
• Python Shell
A Quick PythonTour >>> name = "Alice"
>>>print("Hello world!") >>> age = 30
>>> print("Name:", name)
Hello world!
Name: Alice
>>> print("Age:", age)
>>> number = 42
Age: 30
>>> print(f"The number is: {number}")
The number is: 42 >>> pi = 3.14159265
>>> print("The number is: {}".format(number)) >>> print(f"The value of pi is: {pi:.2f}")
The number is: 42 # Displays pi with 2 decimal places
The value of pi is: 3.14
>>> x = 42
>>> print(f"Value: {x:05}") # Pads with leading zeros to a total width of 5
Value: 00042
A Quick PythonTour
Basic Math:
print(3+5)
print(3-5)
print(3*5)
print(3/5)
print(3//5)
print(3**5)
print(3%5)
A Quick PythonTour

Loops:
A Quick PythonTour
Control Flow:

hour = 16
if hour < 12:
print( 'Good morning!' )
elif hour >= 12 and hour < 20:
print( 'Good afternoon!' )
else:
print( 'Good evening!' )
A Quick PythonTour

Control Flow:
A Quick PythonTour
Loops:

i = 2
while i < 20:
print( i )
i += 1
for i in range(2,10,2):
print( i )
A Quick PythonTour
Data Structures:
A Quick PythonTour
Lists:
countries = ['Portugal','Spain','United Kingdom']
numbers = list(range(10))
len(countries)
countries[0]
countries[1]
countries[2]
numbers[-1]
numbers[-2]
numbers[3:5]
numbers[-2:]
numbers[:-2]
A Quick PythonTour
Lists:
things = "Apples Oranges Crows Telephone Light Sugar"

more_stuff = ["Day", "Night", "Song", "Frisbee", "Corn",

"Banana", "Girl", "Boy"]

while len(stuff) != 10:

next_one = more_stuff.pop()
stuff.append(next_one)
print( "There are %d items now." % len(stuff) )

print stuff[1]
print stuff[-1]
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)

{ ' b r a n d ' : ' F o r d ' , ' m o d e l ' : 'Mustang', 'year’: 1964}

A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

x = thisdict["model"]
print(x)

Mustang
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

x = thisdict.get("model")
print(x)

Mustang
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

thisdict["year"] = 2018

print(thisdict)

{ ' b r a n d ' : ' F o r d ' , ' m o d e l ' : 'Mustang', ' y e a r ' : 2018}
A Quick PythonTour
Functions:

def greet( hour ):

if hour < 12:
print 'Good morning!'
elif hour >= 12 and hour <
20: print 'Good
afternoon!'
else:
print 'Good evening!'
A Quick PythonTour
Importing libraries:

import math
from math import sqrt
from math import sqrt, pow
from math import *
import math as matematik
import numpy as np
TODO:

• Learn hands-onPython
• No needforinstallations,juststart!(Shellruns on browser)

• https://www.codecademy.com/learn/learn-python * * * * *

• https://www.w3schools.com/python/default.asp
• https://www.python.org/about/gettingstarted/
• https://www.learnpython.org/

•Python scientific computing libraries and Jupyter Notebook

Tutorial (document)
Jupyter Notebooks
• https://drive.google.com/file/d/1fXNZDSTvU1ZIOj3sxr86zTc9qVyTJRR
3/view?usp=share_link
• https://drive.google.com/file/d/19SOtD0YFhAG0nv3WIvNqy5TIxHIS6
HTa/view?usp=share_link
• https://drive.google.com/file/d/1YjNcs0IbAH3O7bIknAbyaEdYOJEeiei
v/view?usp=share_link
• https://drive.google.com/file/d/1-VJn5y-
Mv5uYKvj3lYyzh2E4R08WWx5b/view?usp=share_link
References:
• Pavlos Protopapas, Kevin Rader andChris Tanner (data)
• Hendrik Heuer (python tour)

Stats 1 - IITM BS Notes - Part 1
No ratings yet
Stats 1 - IITM BS Notes - Part 1
16 pages
1 Elements, Variables and Data Categorization
No ratings yet
1 Elements, Variables and Data Categorization
27 pages
Week 1-4 Statistics Notes
No ratings yet
Week 1-4 Statistics Notes
91 pages
Marketing Cloud Administrator
No ratings yet
Marketing Cloud Administrator
31 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
FDS Unit II Notes
No ratings yet
FDS Unit II Notes
48 pages
Manual F 650
100% (2)
Manual F 650
727 pages
Data Science
No ratings yet
Data Science
47 pages
Stat For ds-1 (IITM BS Degree)
No ratings yet
Stat For ds-1 (IITM BS Degree)
109 pages
Descriptive Statistics: Overview of Using Data
No ratings yet
Descriptive Statistics: Overview of Using Data
47 pages
Basics of Statistics Unit-I SCLS
No ratings yet
Basics of Statistics Unit-I SCLS
135 pages
Know - Your - Data and Rescaling
No ratings yet
Know - Your - Data and Rescaling
72 pages
Basics of Statistics Unit-I SCLS
No ratings yet
Basics of Statistics Unit-I SCLS
127 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
MS 14L2 Levels of Measurement
100% (1)
MS 14L2 Levels of Measurement
32 pages
Lecture02 Slides
No ratings yet
Lecture02 Slides
47 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
01 Data
No ratings yet
01 Data
100 pages
UE20CS203-Unit1-Class5-Types of Data - Experiments
No ratings yet
UE20CS203-Unit1-Class5-Types of Data - Experiments
51 pages
BoS - Session 1
100% (1)
BoS - Session 1
37 pages
22UCS303 DS-Unit III-N
No ratings yet
22UCS303 DS-Unit III-N
85 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Week 2
No ratings yet
Week 2
30 pages
Dav Theory
No ratings yet
Dav Theory
111 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Module1 Understanding Data1
No ratings yet
Module1 Understanding Data1
56 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Lect 2
No ratings yet
Lect 2
77 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
02 Data Categorization
No ratings yet
02 Data Categorization
25 pages
EBA2123 1.data and Statistics
No ratings yet
EBA2123 1.data and Statistics
36 pages
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
No ratings yet
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
50 pages
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
25 pages
Measurement Scale: Dr. Myint Moe Moe Khin Professor / Head Department of Statistics Monywa University of Economics
No ratings yet
Measurement Scale: Dr. Myint Moe Moe Khin Professor / Head Department of Statistics Monywa University of Economics
27 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Intro
No ratings yet
Intro
67 pages
UNIT-I - Data Categorization-by-Dr - SKY
No ratings yet
UNIT-I - Data Categorization-by-Dr - SKY
22 pages
Understanding Organizing and Presenting Data
No ratings yet
Understanding Organizing and Presenting Data
34 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
59 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Dr. Nguyen Thi Van Anh Department of Biotechnology-Pharmacology
No ratings yet
Dr. Nguyen Thi Van Anh Department of Biotechnology-Pharmacology
48 pages
Unit 3
No ratings yet
Unit 3
16 pages
Unit II Notes
No ratings yet
Unit II Notes
38 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
19 pages
Felcom 12-15-16 Ssas Tie
100% (2)
Felcom 12-15-16 Ssas Tie
80 pages
Quantitative Methods - I (Statistics)
No ratings yet
Quantitative Methods - I (Statistics)
30 pages
Slide BMG106 BMG106 Slide 01
No ratings yet
Slide BMG106 BMG106 Slide 01
25 pages
Lecture 2-Introduction To Satistics
No ratings yet
Lecture 2-Introduction To Satistics
43 pages
Chapter1 - L1 احصاء
No ratings yet
Chapter1 - L1 احصاء
23 pages
RTOS Class Notes
100% (1)
RTOS Class Notes
15 pages
Agriculture Marketplace Online System
No ratings yet
Agriculture Marketplace Online System
12 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
Quantitative Analysis For Business (A)
No ratings yet
Quantitative Analysis For Business (A)
57 pages
Principles of Data Literacy - Introduction To Data Cheatsheet - Codecademy
No ratings yet
Principles of Data Literacy - Introduction To Data Cheatsheet - Codecademy
6 pages
Data Types For Analyst
No ratings yet
Data Types For Analyst
8 pages
Basic Ideas of Data Management
No ratings yet
Basic Ideas of Data Management
32 pages
Ceragon - 1500R Vender PDF
No ratings yet
Ceragon - 1500R Vender PDF
55 pages
Introduction To Statistics - Note
No ratings yet
Introduction To Statistics - Note
16 pages
Introduction To STATISTICS-new
No ratings yet
Introduction To STATISTICS-new
44 pages
01 - Introduction To Biostatistics
No ratings yet
01 - Introduction To Biostatistics
16 pages
Pygame
No ratings yet
Pygame
120 pages
Checksum TR - 4 - Automated Test System - Instr Manual PDF
No ratings yet
Checksum TR - 4 - Automated Test System - Instr Manual PDF
376 pages
BlueCat Whitepaper ProteusMS
No ratings yet
BlueCat Whitepaper ProteusMS
12 pages
CCBoot Manual - Client Manager
No ratings yet
CCBoot Manual - Client Manager
32 pages
IS6335 Week2
No ratings yet
IS6335 Week2
51 pages
Vishal Kumar Singh CV - 2024
No ratings yet
Vishal Kumar Singh CV - 2024
3 pages
Brkarc 3000
No ratings yet
Brkarc 3000
242 pages
Business Requirement Document (BRD)
No ratings yet
Business Requirement Document (BRD)
6 pages
PYTHON Pandas and Manipulation Data
No ratings yet
PYTHON Pandas and Manipulation Data
36 pages
Automate Period Opening
No ratings yet
Automate Period Opening
4 pages
Brochure AVEVA E3DDesignAcademicAccess 07-21.pdf - Coredownload.inline
No ratings yet
Brochure AVEVA E3DDesignAcademicAccess 07-21.pdf - Coredownload.inline
6 pages
IGCSE Cs Lesson Plan
No ratings yet
IGCSE Cs Lesson Plan
8 pages
Lab 8 GIS
No ratings yet
Lab 8 GIS
11 pages
Technical Consulting Report - Fifth Batch - Private EP Networks
No ratings yet
Technical Consulting Report - Fifth Batch - Private EP Networks
29 pages
Media
No ratings yet
Media
5 pages
Network Automation Cookbook Pdf00015
No ratings yet
Network Automation Cookbook Pdf00015
5 pages
The Good Points of Microsoft Excel
No ratings yet
The Good Points of Microsoft Excel
11 pages
EE303A ModelSim-Altera Tutorial
No ratings yet
EE303A ModelSim-Altera Tutorial
34 pages
Plantweb Optics Data Lake: Transform Data Into Intelligent Business Decisions
No ratings yet
Plantweb Optics Data Lake: Transform Data Into Intelligent Business Decisions
7 pages
Orcus Mouse User Manual Instant 825 Sensor
No ratings yet
Orcus Mouse User Manual Instant 825 Sensor
6 pages
NetVu Observer 1.18.11
No ratings yet
NetVu Observer 1.18.11
15 pages
12 Substitutes To Showbox App
No ratings yet
12 Substitutes To Showbox App
3 pages
A Bms Client and Gateway Using Bacnet Protocol: Abstract. A Building Management System (BMS) Is A Computer-Based Control
No ratings yet
A Bms Client and Gateway Using Bacnet Protocol: Abstract. A Building Management System (BMS) Is A Computer-Based Control
2 pages
Globe Intro
No ratings yet
Globe Intro
3 pages

CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets

Uploaded by

CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets

Uploaded by

CENG313 Introduction to Data

Instructor: Assist. Prof. Ceren Güzel Turhan

Each type of measurement is called a variable or an attribute of the data (e.g.

n Numeric variables (quantitative)

o Medals = {bronze, silver, gold}

n Ratio still has no meaning

n Meaningful (unique and non-arbitrary) zero value

What are the ratio-scale variables in the example below?

Common issues withdata:

What’s the issue? How do we fix it?

Problem: each column header represents a single value rather than a

Ref: The Data Science Design Manual by Steven Skiena

Ref: The Data Science Design Manual by Steven Skiena

•How well does movie gross correlate with viewer ratings or

•How do Hollywood movies compare to Korean movies, in

Ref: The Data Science Design Manual by Steven Skiena

•In which movies did Tom Hanks played with

Ref: The Data Science Design Manual by Steven Skiena

Python’s Benevolent Dictator for Life

Whitespace instead of symbols

Variable types determined automatically

Python’s Benevolent Dictator for Life

• Large user community

WHY PEOPLE LIKE IT:

• Code is intuitive and

• Suited to large quantities of

more_stuff = ["Day", "Night", "Song", "Frisbee", "Corn",

while len(stuff) != 10:

{ ' b r a n d ' : ' F o r d ' , ' m o d e l ' : 'Mustang', 'year’: 1964}

def greet( hour ):

•Python scientific computing libraries and Jupyter Notebook

You might also like