CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets
CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets
Science
Lecture 3-4: Data Types and Datasets
n Instances are described through features (also known as attributes or variables or dimensions)
q E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
n The feedback feature (for supervised learning) is called the class attribute
Data
n Data can often be represented or abstracted as an D= n×d data matrix
q n rows corresponding to instances
q d columns correspond to features, feature set F
n The number of instances n is referred to as the size or cardinality of the dataset, n=lDl
n The number of features d is referred to as the dimensionality of the dataset
n Subset of the data: D’⊆ D
n Subspace F’⊆ F
q Subspace projection
An example from the iris dataset
What type ofdata?
What kind of values are in your data (data types)?
Simple or atomic:
• Numeric: integers, floats
• Boolean: binary or true falsevalues
• Strings: sequence of symbols
What type ofdata?
What kind of values are in your data (data types)? Compound,
composed of a bunch ofatomic types:
• Date and time: compound value with a specificstructure
• Lists: a list is a sequence ofvalues
• Dictionaries: A dictionary is a collection of key-value pairs, a pair
of values x : y where x is usually a string called the key
representing the “name” of the entry, and y is a value of any type.
Example: Student record
• First: Kevin
• Last: Rader
• Classes: [CS-109A, STAT139]
How is the datastored?
How is your data represented and stored (data format)?
• Tabular Data: a dataset that is a two-dimensional table, where
each row typically represents a single data record, and each
column represents one type of measurement(csv, dat, xlsx, etc.).
• Structured Data: each data record is presented in a form of a
[possibly complex and multi-tiered] dictionary (json, xml,etc.)
• Semistructured Data: not all records are represented by thesame
set of keys or some data records are not represented using the
key-value pair structure.
How is the datastored?
In tabulardata,weexpecteachrecordor observation torepresenta setof
measurementsof a singleobjector event.
How is the datastored?
q Examples:
n Colors = {brown, green, blue,…,gray},
n Occupation = {engineer, doctor, teacher, …, driver}
Categorical: Ordinal variables
n Similar to nominal variables, but the M states are ordered/ ranked in a meaningful way.
q Allows to apply order relationships, i.e., >, ≥, <, ≤
q However, the difference and ratio between these values has no meaning.
o E.g., 5*-3* is the same as 3*-1* or, 4* is 2 times better than 2*?
q Examples:
o School grades: {A,B,C,D,F}, {AA, BA, BB, CB, CC, DC, DD, FD, FF}, {A+, A, B+, B, C+, C, D,F}
o Movie ratings: {hate, dislike, indifferent, like, love}, {*, **, ***, ****, *****}, {1, 2, 3, 4, 5}
n Examples:
q Calendar dates , Temperature in Farenheit or Celsius, ...
Source: https://www.sagepub.com/sites/default/files/upm-binaries/19708_6.pdf
Common Issues
A few good generic questions to ask are as follows:
● How big is thedataset?
● Is this the entiredataset?
● Is this data representative enough? For example, maybe data was only collected for a subset of users.
●Are there likely to be gross outliers or extraordinary sources of noise? For example, 99% of the traffic from
a web server might be a single denial-of-service attack.
● Might there be artificial data inserted into the dataset? This happens a lot in industrial settings.
●Are there any fields that are unique identifiers? These are the fields you
might use for joining between datasets,etc.
● Are the supposedly unique identifiers actually unique? What does it mean if they aren’t?
● If there are two datasets A and B that need to be joined, what does it mean if something in A doesn’t
matching anything in B?
● When data entries are blank, where does that come from?
● How common are blankentries?
Common Issues
Ref: www.imdb.com
The Internet Movie Database (IMDb)
Ref: www.imdb.com
The Internet Movie Database (IMDb)
Ref: www.imdb.com
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?
Questions
Data
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?
Questions
Data
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?
Problem Solving
Questions
Identify
Data
Collecting Visualize
Curating Analyze
Cleaning Model (Machine Learning) Does the past represent the future?
What do I want to model?
How will the model be used?
What data do I need?
What data do I have?
How hard is it to get data?
The Internet Movie Database (IMDb)
What kind of questions can you answer with this data?
•Which actors appeared in the most films? Appeared in the lowest rated
films? Had the longest career or the shortest lifespan?
•What was the highest rated film each year, or the best in each genre?
Which movies lost the most money, had the highest-powered casts, or
got the least favorable reviews.
• What is the age distribution of actors and actresses in films? How much
younger is the actress playing the wife, on average, than the actor
playing the husband? Has this disparity been increasing or decreasing
with time?
• R
• designed by and for statisticians, and it is natively integrated with graphics
capabilities and extensive statisticalfunctions
• Matlab and Octave
• physics or mechanical/electrical engineering
• Not open source
• SAS (Statistical Analysis Software)
• business statistics applications
• Sociology
• Scala
• up-and-coming language that shows a lot of promise
• doesn’t have the library support for analytics and visualizations
Why Python?
Some tools besides programming languages that might be handy in general:
• Excel: Microsoft products often get a bad rap in the data science world, and it is
completely
undeserved. For simple data analysis, Excel is probably the best piece of software ever
made.
• Tableau: This is a tool for visualizing the data in relational databases. It’s pretty
limited in its functionality, but when it works, the graphics arebeautiful.
• Weka: This is a tool for applying pre-canned machine learning algorithms to datasets
that are already well formatted and contain the relevant features. An advantage of
Weka is that it’s really just a thin GUI wrapper around some Java libraries, so it’s easy
to use the same models in your exploratory analysis and later production code
(assuming that you work in Java).
Why Python for data science?
Guido Van Rossum – the Zen of Python:
Intuitive grammar
• PEP8: style guide
• Well-maintained libraries
• Online guidance
(StackOverflow)
2. Easy to learn and share
• Transparent, reproducible
research through Jupyter
Notebooks
3. Thriving ecosystem of tools
Modeling Evaluate
Data
science
Get data Clean data and and
work-flow analysis present
Example
• BeautifulSoup
libraries • mySQL client • Pandas • Numpy • Jupyter
• API clients • Geopandas • scipy Notebook
(Twitter, ESRI, • Rasterio • statsmodel • Matplotlib
OSMNx…) • SciKitLearn • Flask
Python Development Environments and Code
Editors
• Python IDEs (Integrated Development Environments) specialized software applications designed
for Python development.
• PyCharm, Visual Studio Code with Python extensions, Spyder
• Jupyter Notebook
• Text Editors, Command Line/Terminal: text editors like Notepad (on Windows), TextEdit (on
macOS), or any code-focused text editor (e.g., Sublime Text, Atom, VSCode without specific
Python extensions).
• Online Python Editors/IDEs: Some online platforms directly in our web browser.
• Repl.it, PythonAnywhere, Google Colab (for Jupyter notebooks), Codecademy's Python coding environment, Kaggle
kernel
• Python Shell
A Quick PythonTour >>> name = "Alice"
>>>print("Hello world!") >>> age = 30
>>> print("Name:", name)
Hello world!
Name: Alice
>>> print("Age:", age)
>>> number = 42
Age: 30
>>> print(f"The number is: {number}")
The number is: 42 >>> pi = 3.14159265
>>> print("The number is: {}".format(number)) >>> print(f"The value of pi is: {pi:.2f}")
The number is: 42 # Displays pi with 2 decimal places
The value of pi is: 3.14
>>> x = 42
>>> print(f"Value: {x:05}") # Pads with leading zeros to a total width of 5
Value: 00042
A Quick PythonTour
Basic Math:
print(3+5)
print(3-5)
print(3*5)
print(3/5)
print(3//5)
print(3**5)
print(3%5)
A Quick PythonTour
Loops:
A Quick PythonTour
Control Flow:
hour = 16
if hour < 12:
print( 'Good morning!' )
elif hour >= 12 and hour < 20:
print( 'Good afternoon!' )
else:
print( 'Good evening!' )
A Quick PythonTour
Control Flow:
A Quick PythonTour
Loops:
i = 2
while i < 20:
print( i )
i += 1
for i in range(2,10,2):
print( i )
A Quick PythonTour
Data Structures:
A Quick PythonTour
Lists:
countries = ['Portugal','Spain','United Kingdom']
numbers = list(range(10))
len(countries)
countries[0]
countries[1]
countries[2]
numbers[-1]
numbers[-2]
numbers[3:5]
numbers[-2:]
numbers[:-2]
A Quick PythonTour
Lists:
things = "Apples Oranges Crows Telephone Light Sugar"
print stuff[1]
print stuff[-1]
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)
x = thisdict["model"]
print(x)
Mustang
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
x = thisdict.get("model")
print(x)
Mustang
A Quick PythonTour
Dictionaries:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict["year"] = 2018
print(thisdict)
{ ' b r a n d ' : ' F o r d ' , ' m o d e l ' : 'Mustang', ' y e a r ' : 2018}
A Quick PythonTour
Functions:
import math
from math import sqrt
from math import sqrt, pow
from math import *
import math as matematik
import numpy as np
TODO:
• Learn hands-onPython
• No needforinstallations,juststart!(Shellruns on browser)
• https://www.codecademy.com/learn/learn-python * * * * *
• https://www.w3schools.com/python/default.asp
• https://www.python.org/about/gettingstarted/
• https://www.learnpython.org/