Bab 1
Bab 1
Bab 1
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 2
Data Modeling (2)
Data modeling is a crucial skill for every data scientist,
whether you are doing research design or architecting
a new data store for your company
Data modeling for data science requires the ability to
think clearly and systematically about
the key data points to be stored and retrieved, and
how they should be grouped and related
Data modelling is sometimes as much art as science
Following the rules of normalization can be straightforward,
but knowing when to break them and what data to optimize
for later access takes perception beyond simply applying rules
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 3
Data Modeling (3)
Stages of Data Modeling (i.e., creating schemas):
Conceptual – imposes a theoretical order on data as it
exists in relationship to the entities being described, of the
real-world artifacts or concepts
Logical – imposes order by establishing discrete entities,
key values, and relationships in a logical structure
Physical – breaks the data down into the actual tables,
clusters, and indexes required for the data store
Visual representations of data models: entity-
relationship model/diagram, Bachman diagram,
object-role modeling, Zachman framework
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 4
RDBMS
The RDBMS store paradigm relies on the database
system to maintain internal consistency and
coherence of the data being held in it
Very large datasets have thrown something of a
wrench into the dominance of RDBMS, whether the
data being stored can easily be modeled relationally
or not
When millions or trillions of data points are being
stored, the price of this internal consistency can
bring performance grinding to a halt
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 5
NoSQL (1)
NoSQL databases such as MongoDB, Cassandra, and
HBase have been one of the most promising industry
answers to this problem
These use sometimes radically denormalized data stores
with the sole objective to improve performance
They rely on calling code and queries to handle the sort of
consistency, integrity, and concurrency, offering blinding
speed and scalability over ease-of-use
They adopt simplistic data stores such as:
Key-value stores
Document stores
Graphs
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 6
NoSQL (2)
Modeling these types of stores is a significant
departure from the RDBMS method
Data scientists may start from the result side of the
process, asking themselves, “What question am I trying to
answer?” instead of “What data do I need to store?”
They will completely disregard duplication of data and
have to plan to handle concurrency conflicts and other
integrity issues on the output end rather than in the design
itself
They might choose to aggregate data rather than breaking
it down discretely
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 7
NoSQL (3)
NoSQL data modeling uses advanced techniques:
atomic updates, dimensionality reduction, inverted
search patterns, tree aggregation
Understanding these techniques, and the capabilities
offered by NoSQL, allow data scientists to make the
best choice for what type of data store to use and
how to model it
In almost every case, data scientists in the real world
will end up using a combination of RDBMS and
NoSQL or other exotic data sources as a daily part of
their work
Source: https://www.datasciencegraduateprograms.com/data-modeling/
Data Modeling for Data Science 8
Data Modeling for Data Science
Foundations of Data Systems
Lecture 01
The first version of Twitter struggled to keep up with the load of home
timeline queries, so the company switched to the second version.
The second version of Twitter works better because the average rate of
published tweets is almost two orders of magnitude lower than the rate of
home timeline reads, and so in this case it’s preferable to do more work at
write time and less at read time. The downside is that posting a tweet now
requires a lot of extra work. Some users have > 30M followers. Doing this
in a timely manner – in 5 secs (Twitter’s target) – is a significant challenge.
Data Modeling for Data Science 32
Describing Load (6)
In the example of Twitter, the distribution of followers
per user (maybe weighted by how often those users
tweet) is a key load parameter for discussing
scalability, since it determines the fan-out load
The final twist: Twitter is moving to a hybrid of both
approaches
Most users’ tweets continue to be fanned out to home
timelines at the time when they are posted
A small number of users with a very large number of
followers (i.e., celebrities) are excepted; tweets from any
celebrities that a user may follow are fetched separately
and merged with that user’s home timeline when it is read
Data Modeling for Data Science 33
Describing Performance (1)
Once you have described the load on your system, you
can investigate what happens when the load increases:
When you increase a load parameter and keep the system
resources (CPU, memory, network bandwidth, etc.)
unchanged, how is the performance of your system affected?
When you increase a load parameter, how much do you need
to increase the resources if you want to keep performance
unchanged?
In a batch processing system such as Hadoop, we
usually care about throughput – the number of records
we can process per second, or the total time it takes to
run a job on a dataset of a certain size
Data Modeling for Data Science 34
Describing Performance (2)
In online systems, what’s usually more important is
the service’s response time – the time between a
client sending a request and receiving a response
In practice, in a system handling a variety of requests,
the response time can vary a lot (we need to think of
it as a distribution of values that you can measure)
Most requests are reasonably fast, but there are occasional
outliers that take much longer
In order to figure out how bad your outliers are, you can
look at higher percentiles: 95th, 99th, and 99.9th percentiles
are common (abbreviated p95, p99, and p999), e.g., if the
95th percentile response time is 1.5 secs, that means 5 out
of 100 requests take 1.5 secs or more
Data Modeling for Data Science 35
Describing Performance (3)
In practice, … (cont’d)
High percentiles of response times, a.k.a. tail latencies, are
important because they directly affect users’ experience of
the service, e.g., the customers with the slowest requests
are often the most valuable customers
On the other hand, reducing response times at very high
percentiles (e.g., 99.99th) is difficult because they are easily
affected by random events outside of your control, and the
benefits are diminishing
Percentiles are often used in service level objectives
(SLOs) and service level agreements (SLAs), contracts
that define the expected performance and availability
of a service
Data Modeling for Data Science 36
Describing Performance (4)
For example, an SLA may state:
The service is considered to be up if it has a median
response time < 200 ms and a 99th percentile < 1 s
It may be required to be up at least 99.9% of the time
These metrics set expectations for the clients of the service
and allow customers to demand a refund if the SLA isn’t met
Queueing delays often account for a large part of the
response time at high percentiles
As a server can only process a small number of things in
parallel, it only takes a small number of slow requests to
hold up the processing of subsequent requests (i.e., head-of-
line blocking)
Data Modeling for Data Science 37
Describing Performance (5)
Queueing delays often … (cont’d)
Even if those subsequent requests are fast to process on
the server, the client will see a slow overall response time
due to the time waiting for the prior request to complete
When generating load artificially to test the scalability of a
system, the load-generating client needs to keep sending
requests independently of the response time, and doesn’t
wait for the previous request to complete before sending
the next one (i.e., artificially causing the queues short in
the test, which skews the measurements)