Data Modelling in Google App Engine For Java
Data Modelling in Google App Engine For Java
Slide 2
Google App Engine for Java: What and why?
• Google App Engine (GAE) is a cloud-computing platform that allows
users to build and deliver applications over the internet
• Originally, it was released with support only for Python, but in 2008
extended to provide Java support with Google App Engine for Java
(GAEJ)
• Google charge depending on usage (storage, bandwidth, CPU)
• It provides a cheap way to deploy scalable web applications in Java
• The development environment built in Eclipse and it is fast and easy
to develop applications
• As well as Google’s proprietary datastore there are other built-in
services (email, chat, queues, etc.)
Slide 3
Working with the datastore
• Google allows applications deployed on the GAE to persist data in a
data storage repository (“the datastore”)
– It’s the only storage choice available when using the GAE
• It’s an Object database built on Google’s proprietary BigTable
implementation
• To comply with existing standards and make things as painless as
possible to Java developers Google App Engine for Java (GAEJ)
exposes this datastore via JPA and JDO
– In addition there is a low-level API also provided
– Also, there are some non-implemented parts of the JDO spec
• There are big differences in the design and implementation choices that
can be made between the datastore and traditional relational databases
(RDBs)
– These difference are often obscured by the fact that developers use JDO/JPA which
are more commonly associated with ORMs
Slide 4
Reader beware!
• Some of these guidelines mean that your model will not follow strict data modelling
design rules (either relational or object)
– We are both designers and developers and yes we do understand those rules ...
– We also understand that (some) designers can feel passionately about those rules ...
– However we have chosen to make things developer-friendly rather than designer-friendly.
– When designers get called out on support calls at 1 a.m. we will revise these guidelines to be more
designer-friendly
• The tips in the following section (as are the others) are rules-of-thumb
• Sometimes they make extra work in other areas but simplify dealing with the
datastore
• As always, design of real-world artefacts is a trade-off between time, cost, and
materials
– We like GAEJ because it is cheap and quick to develop on so have made design choices for our apps
that help us work around its differences from other platforms
• The tips are GAEJ-specific and we don’t necessarily recommend them for other data
stores
– We are working on the Eternal and Universally Applicable Data Storage Guide ... watch this space
Slide 5
How this presentation is organised
• Each of the sections contains some guidelines and rules of thumb that we
have found useful
• We use the term data modelling to cover object and relational modelling
• The sections run from high level architecture decisions down through design
choices to development choices. Sections are:
– Architecture: Comparison to RDBs and when to choose GAEJ
– Design: Tips for defining your persistence model
– Development: Understanding the persistence lifecycle, and miscellaneous tips
• Space is limited to we only include snippets of code that are relevant
• Where we have found them useful, we include links to other sources. We’d
encourage you to read and understand them too.
• The documentation around the datastore is getting a lot better and that is
always a good starting point
• The datanucleus JDO documentation is good and definitely worth reading.
Slide 6
Architecture: Relational compared to
datastore
Relational database GAE datastore
• Relational databases are set-based • The datastore holds objects rather
– Relations are intersections of sets than rows
• Relational databases are “strongly-typed” – Relations via associations
– The table defines the shape of the data • The datastore defines Entities but
(i.e. What they data must look like)
these are not strongly typed
• Uses indexes and foreign keys to – Different versions of an Entity can have
navigate different attributes. Up to the app to
– Easier to navigate relationships deal with this.
• Supports SQL as a means to access the • Uses object identity to navigate
underlying data relationships
– Access to data independent of
– Hold object ids as associations
implementation
• No support for SQL or other
independent access
– Data access is via applications or
GAEJ management console
Slide 7
Architecture: When to choose GAEJ?
GAEJ may be a good choice
if ... GAEJ is not a good choice if ...
Slide 8
Design: A bit about persistence
Slide 9
Design tips: Selecting persistent entities
• Before you begin annotating your object model we suggest you divide classes in your model in to managed data and
reference data
• Managed data
– Things that your application looks after, often from birth to death. For example, in a survey application this might be the survey, the participant, the
questions, etc.
– Sometimes the data can be imported (e.g. Getting a list of users from an LDAP and then adding the users’ qualifications to the data) but it is still managed
• Reference data
– Data that supports your applications. For example, country codes, airport codes, currency codes, etc.
Slide 10
Design: Dealing with relationships (1/3)
• Collections
– Ordered collections are supported, but GAEJ needs to create a separate index representing the position in the
collection.
– This means that manipulation of the collection can be potentially inefficient - especially if items are being added in
the middle of the collection as reindexing will need to occur.
– A good use for an ordered collection would be a set of question options in a survey (as the options are extremely
stable once defined).
– A bad use for order would be a league table in an online game (better store in a different data structure that holds
the position and then order in the client).
• Foreign key constraints
– Don't really exist in the sense of referential integrity. There is the concept of a Key object which can be used to
associate a child with a parent (and so feels like an FKEY) but there is no checking by the datastore to see whether
it exists.
• Owned (parent-child) relationships
– The two entities belong in the same EntityGroup.
– In order to perform an update within a transaction, the entities must belong to the same EntityGroup (as being in the
same entity group means that the data is physically co-located)
– Do this by setting the parent’s key in the child object (see Google's documentation)
– However, there is a limitation with this in that GAEJ can only perform 1-10 writes per second to an entity group. This
means that if the object(s) involved will receive more writes per second then the call will block.
Slide 11
Design: Dealing with relationships (2/3)
• What kind of association is it – Composition or Aggregation?
• Composition
– The relationship created and destroyed along with the owning entity (e.g. A survey may have a set of questions associated with it. If
you delete the survey then the questions should be deleted too – assuming there isn’t the concept of a question bank),
– In this case think about modelling the object as an embedded object
– An embedded object has to be Serializable and is extracted/condensed with each storage. You don’t have to worry about navigating
associations as the objects are stored by value
– Make an object embedded by annotating the property with @Embedded. This is equivalent to @Persistent(serialized=“true”) We
prefer the latter as it is not only clearer but it allows you to specify other attributes (more later)
– If you have 1000s of objects then may serializing them isn’t such a great idea as it is much more of an overhead than looking up.
• Aggregation
– A relationship exists between your entity and the other object but the associated object has some kind of independent existence (e.g.
Just because a football team folds it doesn’t mean that all the players have to be destroyed)
– How large is your associated object? Do the associated objects get updated in bulk or usually one at a time?
– It may be worth just storing the objectId of the associated object and then resolving the relationship yourself (using a
PersistenceManager.getObjectById())
– If the objects can be updated independently of the key entity then you probably need to model them as independent persistent
entities. You definitely won’t want to model them as embedded objects in this case.
Slide 12
Design: Dealing with relationships (3/3)
• Can the attributes of the association be denormalised?
– What data do I really need when modelling the association?
• For an address, is the house number and postcode sufficient?
– What data actually changes in the association?
• In a survey participant is it just the answers and the status?
– If so, perhaps these data can be stored in the entity the other side of the relationship
– Updates become a bit more complex as you need to keep the denormalised attributes in step
– If there is a lot of updating happening then denormalising isn’t the right choice
Slide 14
Development : The persistence lifecycle
(2/2)
• Detaching
– You can manually detach
• Call PersistenceManager.detachCopy(object) to get a detach a copy of the object or
detachCopyAll(collection) to detach persistent collections
• This can be a bit messy and error-prone (if you forget to detach)
– You can detach following a successful transaction (our preference)
• By setting PersistenceManager.detachAllOnCommit(true)
• This can also be set in the JDO configuration file but we prefer to make it explicit in the code
• Detaching @Embedded properties can be quirky
– If you have an embedded collection then it is not detached even when setting
detachAllOnCommit (you get an error when trying to access the collection)
– You can work around this by “touching” the object whilst in the transaction
• This effectively fetches the object so that when the detach is called, the collection is present
– A better way is to change the persistence declaration of the object to keep it as
embedded but ensure that the collection is in the same fetch group as its object
• Do this through @Persistent(serialized=“true”, defaultFetchGroup=“true”)
• If you have named fetch groups you will need to change this
Slide 15
Development tips: Working with queries
• Queries in GAEJ looks similar to queries in other ORMs, but there are differences
• The entity you are querying on has to have an index
• There are restrictions on some operators, e.g. <>
• You can’t query on Blob and Text fields
• Sometimes what appears to be a single query will in fact result in two or more queries
• The differences are subtle and non-trivial so we recommend reading the
Google documentation
• If you are going to perform a “join-query” – i.e. filtering on an attribute held in a Collection (e.g.
find all customers with postcode beginning “W1”) then it is probably worth denormalising (e.g.
holding the postcode as an attribute of the Customer)
– You can always make it read-only to callers of the object attribute through class design
• We try to keep our use of queries as simple as possible and only use them to for list behaviours where we
return a collection of objects of the same type
– We know that this isn’t always possible, but it is worth thinking of alternatives to other queries (e.g. splitting in to multiple
queries, resolving in the middle layer) just to make life easier
– Again, this will depend on how many objects you have, how often the query will be run, how often the data changes
– These are all things that you need to consider before writing your data access code
Slide 16
Development tips: Miscellaneous tips
• GAEJ provides access to a memcached cache
• We try to use the cache wherever possible
– The cache built in to GAEJ is easy to use and follows the JSR-107 spec
– There is also a low-level API although we have never had occasion to use it
– The cache is significantly faster than using the datastore
– The Google documentation tells you all you need to know about the API
• More important you must understand your objects’ lifecycle before caching
– When will the object in the cache need to be refreshed?
– When should objects be evicted from the cache?
– This exercise is no different in GAEJ than designing caching in other applications
• Rule of thumb #1: We “always” cache reference data
– There will no doubt be an occasion when we don’t cache some reference data
• Rule of thumb #2: We cache objects that are frequently accessed
– For example, an online survey that is currently in progress
Slide 17
We’d like to hear from you
• We have several blog post that delve in to more detail on the GAEJ.
Feel free to comment on them and share your experiences
– True North blog
• Did you know that Google have a an AppEngine blog oriented to
developers?
• We’re always happy to learn from your experiences with the
datastore or with GAEJ:
– Contact us via Twitter
– Or email us
Slide 18