0% found this document useful (0 votes)
16 views

Lecture 01

Uploaded by

Ten Ten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 01

Uploaded by

Ten Ten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction to Data Science

Dr Marcin Maleszka
Wroclaw University of Science and Technology, Poland
for
International University, Vietnam
Introduction assignment (for class no. 3)
• Using any methods try to find info on me:
• What is my home address / what do I drive (plates number)?
• Alternatively: how would you do this for a person in Vietnam?
• Could the methods used be automated for a large number of people?
• A common method for a single approach (security specialist) would be to ask me or
someone else. In Data Science we need to look for data about thousands/milions of
people. Finding me is an example, we need a method to find a TYPE of person.
• When doing any assignments remember to:
• Give your name / student ID
• Provide final answer and steps to solution
• Be brief but precise
• This task will outline one of first problems a Data Scientists
encounters – where to get the data!
What is „Data”?
• Organization by complexity of concepts:
• Data – raw numbers
• Information – interpretation added
• Knowledge – rules added / pattern extracted
• (Wisdom? Trust? Intelligence?)

• For purposes of Data Science any of those may be the input


• Most often, as in other places, it will be „raw” data
• The result will be often knowledge, but sometimes information
What is Data Science?
• The methods to extract useful information and knowledge from data,
but mostly:
• unexpected patterns
• aggregations (visualisations)
• representations (models)
• Data Science operates on the level of Data Lake
• It takes into account all possible sources in all possible situations
• Many „classic” field are nowadays treated as Data Science
• Statistics
• Data Mining
• Some tools from areas of Machile Learning and Artificial Intelligence
• Graph analysis (Social Networks)
What is this course
• We will briefly follow most common tools of Data Science, mostly
from the point of view of statistics and data mining.
• We will visit Big Data, machine learning, AI, graph methods and others
• Literature:
• Slides! on Blackboard
• Murtaza Haider, Getting Started with Data Science: IBM Press; 1st, 2015.
• Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques 3rd,
2011
• Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman, Mining of Massive
Datasets 2nd, 2014
Grading

• 40% Final Exam (live in IU)


• 30% Mid-term Exam (live in IU)
• 30% Assignments & Tests during online class

• Minimum of 80% attendance required.


Data Scientists & Data Analysts

• Data Scientist is a new catch-all term for an old concept, it may fit to:
• Statisticians
• (and mathematical positions overall)
• Risk Analysts
• (and Analyst positions overall)
• Business Intelligence specialists
• Data Warehouse specialists

• But there are (small) differences and similar positions outside DS name
https://www.datanix.ai/post/iipgh-data-science-webinar-5
http://nirvacana
.com/thoughts/
2013/07/08/b
ecoming-a-data-
scientist/
https://medium.co
m/hackernoon/navi
gating-the-data-
science-career-
landscape-
db746a61ac62
Data Science presentation

• Data Scientist needs to present the result in an attractive form. This


inludes both written part and any graphics (tables and graphs).
• Later assignments and some exam questions will require this part!
• Try your hand at giving narratives to a report on simple data (information):
• Weather – it is 30 C whole week, but 20 C one day
• How to narrate (not „describe”) it in Vietnam? How to narrate it in colder country (Poland)?
• Several years ago a Mars probe crashed because of one team using SI units and the other
Imperial units (e.g. meter vs yard)
• When writing a newspaper report, what narration to build here?
• What narration to use in internal report, a white paper, a scientific paper?
Reporting data - selecting good graphs
• Which graph will fit best?
• Linear graph presenting time with trend line
• Comparison of category distribution – wheel, column
• Gauge / dial
• Result cards
• Progression table
• Raw data
• Choosing a method of presentation depends on intended message
• Remember to add legend and label axes/categories
Graphs

• Important rule: NOT TOO MUCH DATA ON ONE GRAPHS

• Adding labels with values:


• Only if 1-2 data series
• No 3D graphs with data series
• Two graphs: long and short-time perspective
• Simpler graph is more readable
Graphs: trend lines
• Why trends are important?
• Usually trend on a linear graph
• If a lot of graph is empty, may rescale an axis
• May be a layer graph

• Merging column and linear graph


• Only if the same values on X axis
• Very clear legend
• Very good description of Y axis scales
Graphs: comparison graphs
• Comparing the same measure to the previous value
• What is the most important message?

• The more periods compared,


the harder the analysis
• Key is to find good type of graph
• May use 3D if it remains readable
Graphs: attribute distribution graph
• Distribution is best visible on wheel graph
• Need information on time, when data was gathered
• There is no trends
• Notes:
• Add labels
• No more than 10 categories (the rest of the wheel should be labeled „other”)
• Show two wheel graphs for comparisons
• Check if its the most readable approach
• Does not need a Data Warehouse (only one dimension shown)
Filtering in dashboards
• Filtering
• Reduce the range of presented KPI
• Convenient for interactive reports
• Prepared options for changing perspective
• How to compare resutls for different filters?
• May analyze deeper by adding more attributes

• Pivoted tables and graphs:


• May analyze data for different values of same attribute
• Deeper analysis
Data visualization
• Visualization
• Imporant part of understanding information
• Helps to search for information and make decisions
• Supports analysis of larger datasets
• Allows detecting additional dependencies
• Reduces effort needed to process information
• Helps to remember data
• Important aspects:
• Color, shape, size, orientation, position, readability
• Visual form, graphical elements, visual clues
• Conditional formatting
• Infographics, schemas, graphs
Data visualization
• Creating a graph:
1. Determine aim – what to show?
2. Determine what to compare
1. Percentage of whole
2. Ranking
3. Change dynamics
4. Histogram
5. Correlations between variables
3. Prepare graphs
4. Format graph
Example: column graphs
Example: column graphs
The first assignment
(for class no. 5)
• This is how someone presented
data on temperatures
• Is it easy to read and understand?
• Present similar weather data in a more clear format
• Start from raw data that you find in any source.
• Determine what you wish to show (here: how many days were very hot, very
cold in each year in the previous 70 years).
• Create a very clear and easy to follow graph – it may look however you wish,
as long as YOU find it clear.
• We will discuss some of your approaches during class
Organizing the data
• In general Data Science operates on the level of Data Lake – all
information, without filtering, in one place. We take out specific parts
to investigate specific situations.
• This can be operational database of company (OLTP) + external
knowledge sources pooled together in one place.
• Alternatively, the information may be organized in some other form:
• Tabular (denormalized database)
• Multidimensional (data warehouse)
• Tree or Forest (documents)
• Graph (social network)
Historical perspective on DB
• 1960s:
• Data collection, databases are created, networked DBMS
• 1970s:
• Relational model, implementation of relational DBMS
• 1980s:
• RDBMS, advanced data models (extended-relational, object-oriented, etc.)
• 1990s:
• Data mining, data warehouses, multimedia databases, web databases
• 2000s and later:
• moving services to cloud, Big Data, Data Science (nothing fundamentally new)
Historical perspective in business
• Relational databases
• Different systems in a single company
• Accounting
• Sales
• Logistics
• HR
• Client relations
• ERP offers integration of some aspects, but not strategic analysis
• A new tool was required
OLTP = On-Line Transactional Processing
• Contains data oriented towards processes (e.g. invoices)
• The amount of data is limited (e.g. several GB)
• Contains only current data or limited historical data
• Works with a large amount of simple queries
• Contains basic data (atomic values)
• All operations are allowed: adding, modifying, deleting data
OLAP = On-line Analytical Processing
(for pure data source: Data Warehouses)

• Contains data oriented towards topics (e.g. sales, inventory)


• The amount of data is unlimited (e.g. TBs or more)
• Contains current data and ALL historical data
• Works with a very complex queries concerning a lot of data
• Contains basic data and aggregations
• Data is often added, very rarely modified, „never” deleted
Data warehouse – simple definition
Data warehouse is a:
• Topic oriented
• Integrated
• Chronological
• Constant
data collection intended for decission suport task.

• Note: it is not a type of database. Instead both are types of data collection.
Data warehouse and data separation
Data warehouse and data separation
Data profiling
• Candidate keys
• Amount of missing data
• Distribution of data
• Unique values
ETL
• In most basic terms: download data from the source and load it into
the Data Warehouse
• Copying (duplicating) data between data collections (data bases)
• Data is Extracted from OLTP database, Transformed to fit the DW
schema and Loaded into the DW
• A copy of source data may (or may not) be stored on the DW
hardware
• The theoretical aspects of design are more important than eventual
implementation.
ETL and ELT
ETL vs ELT
ETL ETL
• Extract – duplicating the data • Extract – preparing the data
into the temporary staging area from the source in their original
• Needs another server form (schema-on-read)
• Transform – preparing the model • Load – duplicating the raw data
and transforming data to the to the DW server (into Data
desired form (schema-on-write) Lake)
• Load • Transform – using methods
working with non-relational data
or data in different formats and
structures

You might also like