0% found this document useful (0 votes)
20 views69 pages

CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets

The document provides an overview of data types and datasets in data science, explaining the concepts of instances, features, and data representation. It categorizes data types into simple (numeric, boolean, strings) and compound types (dates, lists, dictionaries), and discusses how data is stored in various formats. Additionally, it highlights common issues in data management, such as missing values and data representation challenges, while also suggesting resources for finding datasets.

Uploaded by

Mert Akgüç
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views69 pages

CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets

The document provides an overview of data types and datasets in data science, explaining the concepts of instances, features, and data representation. It categorizes data types into simple (numeric, boolean, strings) and compound types (dates, lists, dictionaries), and discusses how data is stored in various formats. Additionally, it highlights common issues in data management, such as missing values and data representation challenges, while also suggesting resources for finding datasets.

Uploaded by

Mert Akgüç
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CENG313 Introduction to Data

Science
Lecture 3-4: Data Types and Datasets

Instructor: Assist. Prof. Ceren Güzel Turhan


Datasets = instances + features
n Datasets consists of instances (also known as examples or objects or observations)
q e.g., in a university database: students, professors, courses, grades,…
q e.g., in a library database: books, users, loans, publishers, ….
q e.g., in a movie database: movies, actors, director,…

n Instances are described through features (also known as attributes or variables or dimensions)
q E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
n The feedback feature (for supervised learning) is called the class attribute
Data
n Data can often be represented or abstracted as an D= n×d data matrix
q n rows corresponding to instances
q d columns correspond to features, feature set F

n The number of instances n is referred to as the size or cardinality of the dataset, n=lDl
n The number of features d is referred to as the dimensionality of the dataset
n Subset of the data: D’⊆ D
n Subspace F’⊆ F
q Subspace projection
An example from the iris dataset
What type ofdata?
What kind of values are in your data (data types)?

Simple or atomic:
• Numeric: integers, floats
• Boolean: binary or true falsevalues
• Strings: sequence of symbols
What type ofdata?
What kind of values are in your data (data types)? Compound,
composed of a bunch ofatomic types:
• Date and time: compound value with a specificstructure
• Lists: a list is a sequence ofvalues
• Dictionaries: A dictionary is a collection of key-value pairs, a pair
of values x : y where x is usually a string called the key
representing the “name” of the entry, and y is a value of any type.
Example: Student record
• First: Kevin
• Last: Rader
• Classes: [CS-109A, STAT139]
How is the datastored?
How is your data represented and stored (data format)?
• Tabular Data: a dataset that is a two-dimensional table, where
each row typically represents a single data record, and each
column represents one type of measurement(csv, dat, xlsx, etc.).
• Structured Data: each data record is presented in a form of a
[possibly complex and multi-tiered] dictionary (json, xml,etc.)
• Semistructured Data: not all records are represented by thesame
set of keys or some data records are not represented using the
key-value pair structure.
How is the datastored?
In tabulardata,weexpecteachrecordor observation torepresenta setof
measurementsof a singleobjector event.
How is the datastored?

Each type of measurement is called a variable or an attribute of the data (e.g.


seq_id, status and duration are variables or attributes). The number of
attributes is called the dimension. These are often called features.
We expect each table to contain a set of records or observations of the same kind
of object or event (e.g. our table above contains observations of rides/checkouts).
How is the datastored?
We’ll see later that it’s important to distinguish between classes of
variables or attributes based on the type of values they can take on.
• Quantitative variable: is numerical andcan be either:
• discrete - a finite number of values are possible in any bounded
interval. For example: “Number of siblings” is a discrete variable
• continuous - an infinite number of values are possible in any
bounded interval. For example: “Height” is a continuousvariable
• Categorical variable: no inherent order among the values For example:
“What kind of pet you have” is a categorical variable
Basic feature types
n Binary/ Dichotomous variables
n Categorical (qualitative)
q Binary variables
q Nominal variables
q Ordinal variables

n Numeric variables (quantitative)


q Interval-scale variables
q Ratio-scaled variables
Binary/ Dichotomous variables
n The attribute can take two values, {0,1} or {true,false}
q usually, 0 means absence, 1 means presence
q e.g., smoker variable: 1à smoker, 0à non-smoker
n Are both values equally important?
q Symmetric binary: both outcomes are equally important
n e.g., gender (male, female)
q Asymmetric binary: outcomes are not equally important
n e.g., medical tests (positive vs. negative)
n Convention: assign 1 to most important outcome
(e.g., HIV positive)
What are the binary variables in the example below?
Categorical: Nominal variables
n The attribute can take values within a set of M categories/ states (binary variables are a special
case)
q No ordering in the categories/ states.
q Only distinctness relationships apply, i.e., What are the categorical variables in the example below?
n equal (=) and
n different (≠)

q Examples:
n Colors = {brown, green, blue,…,gray},
n Occupation = {engineer, doctor, teacher, …, driver}
Categorical: Ordinal variables
n Similar to nominal variables, but the M states are ordered/ ranked in a meaningful way.
q Allows to apply order relationships, i.e., >, ≥, <, ≤
q However, the difference and ratio between these values has no meaning.
o E.g., 5*-3* is the same as 3*-1* or, 4* is 2 times better than 2*?

q Examples:
o School grades: {A,B,C,D,F}, {AA, BA, BB, CB, CC, DC, DD, FD, FF}, {A+, A, B+, B, C+, C, D,F}

o Movie ratings: {hate, dislike, indifferent, like, love}, {*, **, ***, ****, *****}, {1, 2, 3, 4, 5}

o Medals = {bronze, silver, gold}


What are the categorical variables in the example below?
Numeric: Interval-scale variables
n Differences between values are meaningful
q The difference between 90o and 100o temperature is the same as the difference between
40o and 50o temperature.

n Examples:
q Calendar dates , Temperature in Farenheit or Celsius, ...

n Ratio still has no meaning


q A temperature of 2o Celsius is not much different than a temperature of 1o
Celsius.
q The issue is that the 0o point of the Celsius scale is in a physical sense arbitrary
and therefore the ratio of two Celsius temperatures is not physically meaningful.
Numeric: Ratio-scale variables
n Both differences and ratios have a meaning
q E.g., a 100 kgs person is twice heavy as a 50 kgs person.
q E.g., a 50 years old person is twice old as a 25 years old person.

n Meaningful (unique and non-arbitrary) zero value


n Examples:
q age, weight, length, number of sales
q temperature in Kelvin
n When measured on the Kelvin scale, a temperature of 2o is, in a physical meaningful way, twice that of a 1o.
q The zero value is absolute 0, represents the complete absence of molecular motion

What are the ratio-scale variables in the example below?


Nominal, ordinal, interval-scale, ratio-scale
variables: overview of operations

Source: https://www.sagepub.com/sites/default/files/upm-binaries/19708_6.pdf
Common Issues
A few good generic questions to ask are as follows:
● How big is thedataset?
● Is this the entiredataset?
● Is this data representative enough? For example, maybe data was only collected for a subset of users.
●Are there likely to be gross outliers or extraordinary sources of noise? For example, 99% of the traffic from
a web server might be a single denial-of-service attack.
● Might there be artificial data inserted into the dataset? This happens a lot in industrial settings.
●Are there any fields that are unique identifiers? These are the fields you
might use for joining between datasets,etc.
● Are the supposedly unique identifiers actually unique? What does it mean if they aren’t?
● If there are two datasets A and B that need to be joined, what does it mean if something in A doesn’t
matching anything in B?
● When data entries are blank, where does that come from?
● How common are blankentries?
Common Issues

Common issues withdata:


• Missing values: how do we fill in?
• Wrong values: how can we detect and correct?
• Messy format
• Not usable: the data cannot answer the question posed
Common Issues
•The following is a table accounting for the number of produce
deliveries over a weekend.
•What are the variables in this dataset? What object or event are we
measuring?

What’s the issue? How do we fix it?


Common Issues
We’re measuring individual deliveries; the variables are Time, Day,
Number of Produce.

Problem: each column header represents a single value rather than a


variable. Row headers are “hiding” the Day variable. The values of the
variable, “Number of Produce”, is not recorded in a single column.
Common Issues – Now itis better!
We need to reorganize the information to make explicit the event
we’re observing and the variables associated to this event.
Common Issues – Now itis better!
Common causes of messydata are:
• Column headers are values, notvariable names
• Variables are stored in both rows and columns
• Multiplevariables are stored in one column/entry
• Multiple types of experimental units stored in same table
In general, we want each file to correspond to a dataset, each column
to represent a single variable and each row to represent a single
observation.
We want to tabularize the data. This makes Python happy.
Where to findDatasets?
https://www.kaggle.com/c/titanic/data
Where to findDatasets?
Where to findDatasets?
UCI Machine Learning Repository - 559data sets https://archive.ics.uci.edu/ml/index.php
Where to findDatasets?
Where to findDatasets?
Where else to findDatasets?
Academic Torrents
r/datasets
datahub.io
aws.amazon.com/datasets
Kaggle Data Sources
Kaggle Datasets
World Bank Data
NYC Taxidata
Open Data Philly Connecting people with data forPhiladelphia
National Climatic Data Center -NOAA
ClimateData.us
UNICEF Data
undata
NASA SocioEconomic Data and Applications Center -SEDAC
San Fransisco Government OpenData
The Internet Movie Database (IMDb)
Crowdsourced and curated data about all aspects of the motion picture industry,
at www.imdb.com
• data on over 3.3 million movies and TV programs
• For each film, IMDb includes:
• its title, running time, genres, date of release, and a full list of cast and crew.
• financial data about each production, including the budget for making the film
and how well it did at the box office.
• ratings for each film from viewers and critics (scores on a zero to ten stars
scale)
• written reviews
• links between films: for example, identifying which other films have been
watched most often by viewers of a film

Ref: The Data Science Design Manual by Steven Skiena


The Internet Movie Database (IMDb)

Ref: www.imdb.com
The Internet Movie Database (IMDb)

Ref: www.imdb.com
The Internet Movie Database (IMDb)

Ref: www.imdb.com
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

Questions

Data
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

Questions

Data
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?
Problem Solving

Questions
Identify

Data
Collecting Visualize
Curating Analyze
Cleaning Model (Machine Learning) Does the past represent the future?
What do I want to model?
How will the model be used?
What data do I need?
What data do I have?
How hard is it to get data?
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

•Which actors appeared in the most films? Appeared in the lowest rated
films? Had the longest career or the shortest lifespan?

•What was the highest rated film each year, or the best in each genre?
Which movies lost the most money, had the highest-powered casts, or
got the least favorable reviews.

Ref: The Data Science Design Manual by Steven Skiena


The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?

•How well does movie gross correlate with viewer ratings or


awards?

•How do Hollywood movies compare to Korean movies, in


terms of ratings, budget, and gross? Are foreign movies
better received than American films, and how does this
differ between U.S. and non-U.S. reviewers?

Ref: The Data Science Design Manual by Steven Skiena


The Internet Movie Database
(IMDb)
What kind of questions can you answer with this data?

• What is the age distribution of actors and actresses in films? How much
younger is the actress playing the wife, on average, than the actor
playing the husband? Has this disparity been increasing or decreasing
with time?

•In which movies did Tom Hanks played with


Keanu Reeves?

Ref: The Data Science Design Manual by Steven Skiena


Why Python?
Why Python?
Why Python?
•Python is the most popular language for data scientists. R is its only
major competitor, at least when it comes tofree tools.
• Graphical package such as Excel or Tableau available, but not as flexible as
code!
• High-level scripting language
• A jackof-all-trades
• If you only need to worry about statistics, or numerical computation, or
web parsing, then there are better options out there. But if you need to do
all of these things within a single project, then Python is your best option.
Since data science is so inherently multidisciplinary, this makes it a perfect
fit.
• Open source frameworks and libraries!
• Scientific Computing with Python (Numpy, Scipy,Scikit-Learn)
Why Python?
Other Programming Languages for DataScience:

• R
• designed by and for statisticians, and it is natively integrated with graphics
capabilities and extensive statisticalfunctions
• Matlab and Octave
• physics or mechanical/electrical engineering
• Not open source
• SAS (Statistical Analysis Software)
• business statistics applications
• Sociology
• Scala
• up-and-coming language that shows a lot of promise
• doesn’t have the library support for analytics and visualizations
Why Python?
Some tools besides programming languages that might be handy in general:

• Excel: Microsoft products often get a bad rap in the data science world, and it is
completely
undeserved. For simple data analysis, Excel is probably the best piece of software ever
made.
• Tableau: This is a tool for visualizing the data in relational databases. It’s pretty
limited in its functionality, but when it works, the graphics arebeautiful.
• Weka: This is a tool for applying pre-canned machine learning algorithms to datasets
that are already well formatted and contain the relevant features. An advantage of
Weka is that it’s really just a thin GUI wrapper around some Java libraries, so it’s easy
to use the same models in your exploratory analysis and later production code
(assuming that you work in Java).
Why Python for data science?
Guido Van Rossum – the Zen of Python:

Python’s Benevolent Dictator for Life


Why Python for data science?
Guido Van Rossum – the Zen of Python:

Whitespace instead of symbols


• tabs, indentation and line-breaks matter
• code remains uncluttered

Variable types determined automatically


• no need to declare the type of your variables
before assigning values

Intuitive grammar
• PEP8: style guide

Python’s Benevolent Dictator for Life


Three advantages:
1. Python is
popular

• Large user community

• Well-maintained libraries

• Online guidance
(StackOverflow)
2. Easy to learn and share

WHY PEOPLE LIKE IT:

• Code is intuitive and


expressive (compare C++)

• Suited to large quantities of


data

• Transparent, reproducible
research through Jupyter
Notebooks
3. Thriving ecosystem of tools

Modeling Evaluate
Data
science
Get data Clean data and and
work-flow analysis present

Example
• BeautifulSoup
libraries • mySQL client • Pandas • Numpy • Jupyter
• API clients • Geopandas • scipy Notebook
(Twitter, ESRI, • Rasterio • statsmodel • Matplotlib
OSMNx…) • SciKitLearn • Flask
Python Development Environments and Code
Editors
• Python IDEs (Integrated Development Environments) specialized software applications designed
for Python development.
• PyCharm, Visual Studio Code with Python extensions, Spyder
• Jupyter Notebook
• Text Editors, Command Line/Terminal: text editors like Notepad (on Windows), TextEdit (on
macOS), or any code-focused text editor (e.g., Sublime Text, Atom, VSCode without specific
Python extensions).
• Online Python Editors/IDEs: Some online platforms directly in our web browser.
• Repl.it, PythonAnywhere, Google Colab (for Jupyter notebooks), Codecademy's Python coding environment, Kaggle
kernel
• Python Shell
A Quick PythonTour >>> name = "Alice"
>>>print("Hello world!") >>> age = 30
>>> print("Name:", name)
Hello world!
Name: Alice
>>> print("Age:", age)
>>> number = 42
Age: 30
>>> print(f"The number is: {number}")
The number is: 42 >>> pi = 3.14159265
>>> print("The number is: {}".format(number)) >>> print(f"The value of pi is: {pi:.2f}")
The number is: 42 # Displays pi with 2 decimal places
The value of pi is: 3.14
>>> x = 42
>>> print(f"Value: {x:05}") # Pads with leading zeros to a total width of 5
Value: 00042
A Quick PythonTour
Basic Math:
print(3+5)
print(3-5)
print(3*5)
print(3/5)
print(3//5)
print(3**5)
print(3%5)
A Quick PythonTour

Loops:
A Quick PythonTour
Control Flow:

hour = 16
if hour < 12:
print( 'Good morning!' )
elif hour >= 12 and hour < 20:
print( 'Good afternoon!' )
else:
print( 'Good evening!' )
A Quick PythonTour

Control Flow:
A Quick PythonTour
Loops:

i = 2
while i < 20:
print( i )
i += 1
for i in range(2,10,2):
print( i )
A Quick PythonTour
Data Structures:
A Quick PythonTour
Lists:
countries = ['Portugal','Spain','United Kingdom']
numbers = list(range(10))
len(countries)
countries[0]
countries[1]
countries[2]
numbers[-1]
numbers[-2]
numbers[3:5]
numbers[-2:]
numbers[:-2]
A Quick PythonTour
Lists:
things = "Apples Oranges Crows Telephone Light Sugar"

more_stuff = ["Day", "Night", "Song", "Frisbee", "Corn",


"Banana", "Girl", "Boy"]

while len(stuff) != 10:


next_one = more_stuff.pop()
stuff.append(next_one)
print( "There are %d items now." % len(stuff) )

print stuff[1]
print stuff[-1]
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)

{ ' b r a n d ' : ' F o r d ' , ' m o d e l ' : 'Mustang', 'year’: 1964}


A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

x = thisdict["model"]
print(x)

Mustang
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

x = thisdict.get("model")
print(x)

Mustang
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

thisdict["year"] = 2018

print(thisdict)

{ ' b r a n d ' : ' F o r d ' , ' m o d e l ' : 'Mustang', ' y e a r ' : 2018}
A Quick PythonTour
Functions:

def greet( hour ):


if hour < 12:
print 'Good morning!'
elif hour >= 12 and hour <
20: print 'Good
afternoon!'
else:
print 'Good evening!'
A Quick PythonTour
Importing libraries:

import math
from math import sqrt
from math import sqrt, pow
from math import *
import math as matematik
import numpy as np
TODO:

• Learn hands-onPython
• No needforinstallations,juststart!(Shellruns on browser)

• https://www.codecademy.com/learn/learn-python * * * * *

• https://www.w3schools.com/python/default.asp
• https://www.python.org/about/gettingstarted/
• https://www.learnpython.org/

•Python scientific computing libraries and Jupyter Notebook


Tutorial (document)
Jupyter Notebooks
• https://drive.google.com/file/d/1fXNZDSTvU1ZIOj3sxr86zTc9qVyTJRR
3/view?usp=share_link
• https://drive.google.com/file/d/19SOtD0YFhAG0nv3WIvNqy5TIxHIS6
HTa/view?usp=share_link
• https://drive.google.com/file/d/1YjNcs0IbAH3O7bIknAbyaEdYOJEeiei
v/view?usp=share_link
• https://drive.google.com/file/d/1-VJn5y-
Mv5uYKvj3lYyzh2E4R08WWx5b/view?usp=share_link
References:
• Pavlos Protopapas, Kevin Rader andChris Tanner (data)
• Hendrik Heuer (python tour)

You might also like