100% found this document useful (1 vote)
422 views16 pages

8.2. CQL Exercises

The document provides details about the data that will be used in a course on data analysis. It includes information on two datasets - the Movielens dataset containing movie ratings from 6,000 users on 4,000 movies, and a randomly generated weather dataset with 500,000 entries. It then describes the structure of the Movielens data files and provides sample extracts. Finally, it outlines exercises that will be completed using CQL to interact with the data, including creating and modifying keyspaces and tables.

Uploaded by

Alexandra Ureche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
422 views16 pages

8.2. CQL Exercises

The document provides details about the data that will be used in a course on data analysis. It includes information on two datasets - the Movielens dataset containing movie ratings from 6,000 users on 4,000 movies, and a randomly generated weather dataset with 500,000 entries. It then describes the structure of the Movielens data files and provides sample extracts. Finally, it outlines exercises that will be completed using CQL to interact with the data, including creating and modifying keyspaces and tables.

Uploaded by

Alexandra Ureche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Valentina.Crisan@academyofdata.

com 

Throughout​ ​the​ ​course​ ​we​ ​will​ ​work​ ​with​ ​2​ ​sets​ ​of​ ​data:

Movielens​ ​data​ ​set​ ​(real​ ​data):​​ ​6k​ ​users,​ ​4k​ ​movies,​ ​1​ ​million​ ​ratings​ ​-​ ​we​ ​copy​ ​only​ ​10k
entries
Weather​ ​data​ ​(random​ ​generated)​​ ​500k​ ​entries

Movielens​ ​data​ ​details:

Movies.csv​ ​extract

mid title genres year

1 Toy​ ​Story​ ​(1995) {Animation,Children's,Comedy} 1995

2 Jumanji​ ​(1995) {Adventure,Children's,Fantasy} 1995

3 Grumpier​ ​Old​ ​Men​ ​(1995) {Comedy,Romance} 1995

4 Waiting​ ​to​ ​Exhale​ ​(1995) {Comedy,Drama} 1995

5 Father​ ​of​ ​the​ ​Bride​ ​Part​ ​II​ ​(1995) {Comedy} 1995

Users.csv​ ​extract

uid gender age ocupation zip

1 F 1 10 48067

2 M 56 16 70072

3 M 25 15 55117

4 M 45 7 0246

Ratings2.csv

ruid rmid rating timestamp

1 1193 5 978300760

1 661 3 978302109

1 914 3 978301968

1 3408 4 978300275

1 2355 5 978824291

Denormalized​ ​data​ ​-​ ​Ratings.csv​ ​data

uid age gender ocupation zip rating datetime mid title year genres

5234 25 F 0 60657 1 2000-06-19​ ​20:40:40 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}


[email protected] 

1835 25 M 19 11501 3 2000-11-22​ ​08:11:31 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

6035 25 F 1 78734 2 2000-04-26​ ​01:25:35 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

4637 35 F 9 48302 3 2000-07-19​ ​18:48:41 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

3841 45 M 18 26101 3 2000-08-11​ ​13:14:31 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

2242 18 M 4 60115 5 2000-11-19​ ​02:40:43 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

weather-data.csv

w_id​​ ​ ​ ​ ​ ​ ​ ​|​ ​event_time​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​|​ ​temperature

------------+---------------------------------+-------------

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:02:54.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​20

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:10:00.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​23

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:29:53.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​20

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​01:04:22.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​12

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​01:17:15.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​18

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​02:52:25.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​23

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​03:09:17.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​11

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​03:14:57.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​13

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​04:13:33.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​14

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​04:47:56.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​13

All​ ​the​ ​above​ ​files​ ​can​ ​be​ ​found​ ​in​ ​the​ ​data​ ​directory
(both​ ​cloud​ ​and​ ​local​ ​installations).

cd​ ​/data
Ls​ ​-l

Note:​ ​for​ ​the​ ​intro​ ​part​ ​we​ ​play​ ​a​ ​bit​ ​with​ ​different
data​ ​types​ ​and​ ​we​ ​insert​ ​data​ ​manually,​ ​before​ ​we
work​ ​with​ ​the​ ​above​ ​data.
[email protected] 

CQL​ ​exercises
Reference​ ​for​ ​syntax​ ​questions:
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html

Run​ ​Cqlsh

1. Create​ ​a​ ​Keyspace​ ​-​ ​for​ ​local​ ​installations​ ​use


metro_systems,​ ​but​ ​for​ ​cloud​ ​installations​ ​use​ ​your​ ​name
(e.g.​ ​my​ ​keyspace​ ​is​ ​valentina_c)​ ​--​ ​since​ ​all​ ​cloud​ ​users​ ​need
to​ ​create​ ​a​ ​separate​ ​working​ ​space.​ ​You​ ​will​ ​use​ ​that​ ​name
instead​ ​of​ ​metro_systems​ ​in​ ​the​ ​commands​ ​below.

CREATE​ ​KEYSPACE​ ​metro_systems​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'SimpleStrategy'​ ​,​ ​'replication_factor'​ ​:​ ​1​ ​};

Desc​ ​keyspaces​ ​ ​ ​//​ ​will​ ​list​ ​all​ ​keyspaces​ ​on​ ​the​ ​cluster

describe​ ​metro_systems;​ ​//​ ​you​ ​can​ ​see​ ​the​ ​Replication​ ​factor​ ​&​ ​strategy
set​ ​at​ ​keyspace​ ​level

Exit​ ​cqlsh​ ​and​ ​run​ ​nodetool​ ​status​ ​command​ ​-​ ​this​ ​should​ ​give​ ​you​ ​an​ ​image​ ​of​ ​your​ ​cluster
(even​ ​if​ ​it’s​ ​a​ ​1​ ​node​ ​cluster).​ ​Note​ ​the​ ​datacenter​ ​name.

Re-enter​ ​Cqlsh

Let’s​ ​create​ ​another​ ​keyspaceNetwork​ ​Topology​ ​replication​ ​strategy​ ​and​ ​Replication​ ​factor
1.

CREATE​ ​KEYSPACE​ ​metro_systems1​ ​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy',​ ​ ​ ​'datacenter1'​ ​:​ ​1​ ​};

​ ​describe​ ​metro_systems1;

Let’s​ ​alter​ ​a​ ​created​ ​keyspace​ ​and​ ​change​ ​the​ ​property​ ​durable​ ​writes:

ALTER​ ​KEYSPACE​ ​metro_systems1​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy'​ ​,​ ​ ​'datacenter1'​ ​:​ ​1}​ ​and​ ​DURABLE_WRITES​ ​=
false;

OR​ ​if​ ​we​ ​want​ ​to​ ​modify​ ​the​ ​replication​ ​factor​ ​from​ ​1​ ​to​ ​3:

ALTER​ ​KEYSPACE​ ​metro_systems1​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy'​ ​,​ ​ ​'datacenter1'​ ​:​ ​3};
[email protected] 

ALTER​ ​KEYSPACE​ ​valentina_c_1​ W


​ ITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:
'NetworkTopologyStrategy'​ ​,​ ​ ' ​ datacenter1'​ ​:​ ​3};

Desc​ ​metro_systems1

Let’s​ ​drop​ ​keyspace​ ​metro_systems1​:

DROP​ ​KEYSPACE​ ​metro_systems1;

Stop:​ ​let’s​ ​explain​ ​what​ ​means​ ​durable​ ​writes​ ​option​ ​in​ ​the​ ​keyspace​ ​description?

2.​ ​Let’s​ ​create​ ​and​ ​modify​ ​a​ ​column​ ​family

First,​ ​connect​ ​to​ ​the​ ​created​ ​keyspace:​ ​i.e.​ ​all​ ​further​ ​CQL​ ​commands​ ​will​ ​be​ ​executed​ ​in​ ​the
context​ ​of​ ​the​ ​chosen​ ​keyspace.

USE​ ​metro_systems;

2.1.​ ​Create​ ​Table​ ​users​ ​with​ ​the​ ​next​ ​structure​:​ ​userid​ ​(uuid),​ ​gender​ ​(text),​ ​age​ ​(int),
occupation​ ​(text),​ ​zipcode​ ​(int)​ ​and​ ​compound​ ​primary​ ​key:​ ​partition​ ​key​ ​userid​ ​and​ ​cluster
key​ ​gender.

CREATE​ ​TABLE​ ​users​ ​(


userid​ ​uuid,
name​ ​text,
gender​ ​text,
age​ ​int,
occupation​ ​text,
zipcode​ ​int,
PRIMARY​ ​KEY​ ​((userid),name));

DESCRIBE​ ​users;​ ​//​ ​you​ ​can​ ​see​ ​all​ ​properties​ ​of​ ​the​ ​table

2.2.​ ​ ​ALTER:​ ​Add/Delete/Rename​ ​columns,​ ​Alter​ ​Column​ ​type;

ALTER​ ​TABLE​ ​users​ ​ADD​ ​relatives​ ​int;​ ​ ​ ​//​ ​add​ ​a​ ​new​ ​column​ ​to​ ​the​ ​table 

SELECT​ ​*​ ​FROM​ ​users;

ALTER​ ​TABLE​ ​users​ ​RENAME​ ​userid​ ​TO​ ​uid;​​ ​//rename​ ​column 


[email protected] 

ALTER​ ​TABLE​ ​users​ ​ALTER​ ​relatives​ ​type​ ​blob;​ ​ ​//​ ​alter​ ​column​ ​type​ ​from​ ​int​ ​to​ ​blob  

Describe​ ​users​;​ ​//​ ​to​ ​see​ ​the​ ​types​ ​of​ ​the​ ​columns 

Please​ ​note​:​ ​Alter​ ​columns​ ​types,​ ​some​ ​changes​ ​will​ ​not​ ​be​ ​possible,​ ​like​ ​for​ ​example: 
varchar​ ​to​ ​ascii.  

ALTER​ ​TABLE​ ​users​ ​DROP​ ​relatives;​ ​ ​ ​//delete​ ​the​ ​new​ ​added​ ​column​ ​to​ ​the​ ​table 

SELECT​ ​*​ ​FROM​ ​users;

Describe​ ​users;

DROP​ ​TABLE​ ​users;​ ​//​ ​deletes​ ​all​ ​the​ ​data​ ​of​ ​the​ ​table​ ​including​ ​the​ ​table​ ​structure​ ​& 
indexes 

Desc​ ​tables;​ ​ ​//​ ​no​ ​tables 

2.3.​ ​Insert​ ​values​ ​in​ ​the​ ​created​ ​table

Assuming​ ​you​ ​dropped​ ​the​ ​table​ ​users​ ​previously,​ ​let’s​ ​re-create​ ​it.

CREATE​ ​TABLE​ ​users​ ​(


uid​ ​uuid,
name​ ​text,
gender​ ​text,
age​ ​int,
occupation​ ​text,
zipcode​ ​int,
PRIMARY​ ​KEY​ ​((uid),name));

We​ ​have​ ​a​ ​table,​ ​now​ ​let’s​ ​insert​ ​data​ ​manually.​ ​We​ ​will​ ​learn​ ​later​ ​also​ ​how​ ​to​ ​insert​ ​data
from​ ​a​ ​file,​ ​but​ ​for​ ​the​ ​moment​ ​let’s​ ​do​ ​some​ ​manual​ ​inserts/updates.

Observation​:​ ​the​ ​UUID​ ​can​ ​be​ ​generated​ ​automatically​ ​with​ ​uuid​ ​();

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66,​ ​'Andrew',​ ​25,​ ​'M',​ ​'plummer',
11501);

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ a


​ ge,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(UUID(),'Dana',​ ​24,​ ​'F','data​ s ​ cientist',​ ​11501);
[email protected] 

SELECT​ ​*​ ​FROM​ ​users;

TRACING​ ​ON;​ ​//​ ​allows​ ​to​ ​see​ ​the​ ​events​ ​in​ ​background​ ​for​ ​Cassandra.

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66,​ ​'Lili',​ ​24,​ ​'F',​ ​'actress',​ ​11503);
//​ ​note​ ​that​ ​I​ ​am​ ​inserting​ ​data​ ​in​ ​the​ ​same​ ​partition​ ​as​ ​the​ ​first
insert.​ ​Given​ ​that​ ​the​ ​Clustering​ ​Key​ ​is​ ​different​ ​the​ ​data​ ​will​ ​be​ ​added
to​ ​the​ ​existing​ ​partition.

TRACING​ ​OFF

SELECT​ ​*​ ​FROM​ ​users;​ ​//​ ​remember​ ​the​ ​way​ ​Cassandra​ ​stores​ ​data​ ​is
different​ ​than​ ​the​ ​CQL​ ​view.

Remember​ ​for​ ​the​ ​next​ ​section:​ ​Cassandra​ ​doesn’t​ ​read​ ​(by​ ​default)​ ​before
performing​ ​writes.

2.4.​ ​Let’s​ ​do​ ​also​ ​some​ ​upserts

UPDATE​ u ​ sers​ ​SET​ a


​ ge​ ​=​ 3​ 5​ ​WHERE​ ​uid​ ​=​ 5
​ 132b130-ae79-11e4-ab27-0800200c9a66​ ​and
name​ ​=​ '​ Andrew';​ / ​ /​ ​we​ u​ pdate​ ​the​ ​age​ o ​ f​ ​an​ ​existing​ ​user

Select​ ​*​ ​from​ ​users;​ ​//​ ​to​ ​see​ ​the​ ​change

UPDATE​ ​users​ ​SET​ ​age​ ​=​ ​35​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a77​ ​and
name​ ​=​ ​'Andrew';​ ​//​ ​the​ ​UUID​ ​-​ ​partition​ ​key​ ​-​ ​is​ ​not​ ​existent​ ​thus​ ​the​ ​update
will​ ​do​ ​an​ ​insert

SELECT​ ​*​ ​from​ ​users;​ ​//​ ​let’s​ ​check​ ​that​ ​everything​ ​went​ ​well

UPDATE​ u ​ sers​ ​SET​ a ​ ge​ ​=​ ​44​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66​ ​and
name​ ​=​ '​ Elaine';​ / ​ /​ ​the​ ​partition​ ​key​ ​existing​ ​but​ ​not​ ​with​ ​this​ ​clustering
key,​ ​a​ n ​ ew​ ​entry​ i ​ nside​ ​an​ ​existing​ ​partition​ ​will​ ​be​ ​generated

Select​ ​*​ ​from​ ​users;

Select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66;​ ​//​ ​to
see​ ​all​ ​entries​ ​in​ ​this​ ​partition

Lets​ ​do​ ​some​ ​more​ ​inserts.​ ​The​ ​first​ ​row​ ​inserted​ ​will​ ​have​ ​a​ ​new​ ​partition​ ​and​ ​cluster​ ​key​.​ ​The
second​ ​row​ ​will​ ​have​ ​the​ ​same​ ​partition​ ​key​ ​but​ ​a​ ​different​ ​clustering​ ​key​.

insert​ ​into​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​values
(uuid(),'John',35,'M','data​ ​analyst',11502);​ ​//​ ​see​ ​the​ ​uuid​ ​generated​ ​here
and​ ​use​ ​it​ ​for​ ​the​ ​next​ ​insert

insert​ ​into​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​values
(90278d0f-6053-4c87-a1f3-69a05853d2a8,'Elaine',34,'F','engineer',11502);
[email protected] 

Select​ ​*​ ​from​ ​users;

Insert​ ​into​ ​users​ ​(uid,name,​ ​age)​ ​values(


90278d0f-6053-4c87-a1f3-69a05853d2a8,​ ​'Elaine',43);​ ​//​ ​see​ ​how​ ​insert​ ​can
be​ ​used​ ​as​ ​an​ ​update​ ​as​ ​well

Before​ ​we​ ​move​ ​on,​ ​type​ ​exit​ ​in​ ​cqlsh​ ​and​ ​lets​ ​see​ ​some​ ​stats​ ​for​ ​our​ ​table.

Nodetool​ ​cfstats​ ​keyspace.table​ ​//generic​ ​command,​ ​see​ ​below​ ​for​ ​metro_systems


keyspace

Nodetool​ ​cfstats​ ​metro_systems.users​ ​ ​//​ ​for​ ​cloud​ ​users​ ​use​ ​yourname.users

See​ ​that​ ​data​ ​is​ ​still​ ​in​ ​memtable,​ ​no​ ​SStables​ ​generated,​ ​see​ ​also​ ​the​ ​estimated​ ​number
of​ ​partitions​ ​(Number​ ​of​ ​keys​ ​(estimate))

2.5.​ ​Lightweight​ ​operations

Re-enter​ ​cqlsh​ ​and​ ​your​ ​respective​ ​keyspace

Use​ ​metro_systems;

Let’s​ ​try​ ​a​ ​an​ ​update​ ​on​ ​another​ ​non​ ​existing​ ​entry​ ​but​ ​with​ ​LWT:

UPDATE​ ​users​ ​SET​ ​age​ ​=​ 3


​ 5​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a88
and​ ​name​ ​=​ ​'Andrew'​ ​IF​ E​ XISTS;

Now​ ​let’s​ ​try​ ​an​ ​update​ ​on​ ​an​ ​existing​ ​entry​ ​with​ ​LWT:

UPDATE​ ​users​ ​SET​ ​age​ ​=​ 4


​ 4​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a77
and​ ​name​ ​=​ ​'Andrew'​ ​IF​ E​ XISTS;

Exit​ ​CQL

Let’s​ ​force​ ​a​ ​flush​ ​to​ ​write​ ​the​ ​data​ ​on​ ​disk​ ​in​ ​an​ ​SStable.​ ​Run

Nodetool​ ​flush​ ​metro_systems​ ​users​ ​//​ ​For​ ​cloud​ ​users​ ​please​ ​make​ ​sure​ ​you​ ​use
your​ ​own​ ​keyspace

Go​ ​to​ ​cd​ ​/var/lib/data/cassandra/​ ​.​ ​Choose​ ​your​ ​keyspace​ ​directory​ ​and​ ​the​ ​directory
of​ ​the​ ​table.​ ​Use​ ​ls​ ​-l​ ​to​ ​list​ ​the​ ​tables​ ​generated​ ​after​ ​flush.

IF​ ​you​ ​would​ ​like​ ​to​ ​see​ ​how​ ​an​ ​SStable​ ​looks​ ​like​ ​you​ ​can​ ​use​ ​:​ ​sstabledump
nameofthedatafile​ ​//​ ​e.g.​ ​sstabledump​ ​mc-1-big-Data.db

Sstabledump​ ​-e​ ​nameofthedatafile​ ​//​ ​will​ ​show​ ​us​ ​the​ ​partitions​ ​in​ ​the​ ​specific​ ​SStable
[email protected] 

STOP​ ​here​ ​for​ ​theory​ ​update.​ ​Queries​ ​section.

3.​ ​Let’s​ ​do​ ​some​ ​queries​ ​on​ ​the​ ​data​ ​-​ ​ ​we​ ​will​ ​come​ ​back​ ​to
Queries,​ ​this​ ​is​ ​just​ ​to​ ​get​ ​us​ ​going​ ​with​ ​the​ ​syntax

Cd​ ​/

Re-enter​ ​CQL​ ​with​ ​Cqlsh​ ​command​ ​and​ ​use​ ​your​ ​keyspace;

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66;

//​ ​You​ ​can​ ​also​ ​pick​ ​the​ ​columns​ ​to​ ​display​ ​instead​ ​of​ ​choosing​ ​all​ ​data.

SELECT​ ​age,gender​ ​FROM​ ​users​ ​where​ ​uid​ ​=


5132b130-ae79-11e4-ab27-0800200c9a66;

SELECT​ ​*​ ​FROM​ ​users​ ​LIMIT​ ​3;​ ​//​ ​you​ ​can​ ​limit​ ​the​ ​number​ ​of​ ​returned
values

//You​ ​can​ ​fine-tune​ ​the​ ​display​ ​order​ ​using​ ​the​ ​ORDER​ ​BY​ ​clause,​ ​either​ ​descending
or​ ​ascending.​ ​The​ ​partition​ ​key​ ​must​ ​be​ ​defined​ ​in​ ​the​ ​WHERE​ ​clause​ ​and​ ​the​ ​ORDER
BY​ ​clause​ ​defines​ ​the​ ​clustering​ ​column​ ​to​ ​use​ ​for​ ​ordering.​​ ​Order​ ​By​ ​works​ ​only​ ​on
clustering​ ​keys.

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66​ ​ORDER


BY​ ​name​ ​DESC;

//​ ​You​ ​can​ ​query​ ​for​ ​a​ ​partition​ ​key​ ​that​ ​is​ ​part​ ​of​ ​a​ ​list​ ​of​ ​partition​ ​keys

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​IN​ ​(5132b130-ae79-11e4-ab27-0800200c9a66,


edb032e8-5c4e-403f-8dfc-43c2dcd69dba);

You​ ​can​ ​return​ ​the​ ​number​ ​of​ ​records​ ​that​ ​fulfills​ ​a​ ​request:

SELECT​ ​COUNT(*)​ ​FROM​ ​users​ ​where​ ​uid​ ​=


5132b130-ae79-11e4-ab27-0800200c9a66;

You​ ​can​ ​ask​ ​for​ ​the​ ​distinct​ ​partitions​ ​in​ ​a​ ​table​ ​(DISTINCT​ ​works​ ​only​ ​on​ ​partition
keys):

select​ ​distinct​ ​uid​ ​from​ ​users;

And​ ​also​ ​count​ ​how​ ​many​ ​distinct​ ​partitions​ ​exist:


[email protected] 

select​ ​distinct​ ​uid,​ ​count(*)​ ​from​ ​users;

3.1.​ ​CREATE​ ​AN​ ​INDEX

SELECT​ ​*​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'M';​​ ​//​ ​WILL​ ​NOT​ ​WORK

//let’s​ ​create​ ​an​ ​index

​ NDEX​ ​users_by_gender​ ​ON​ ​users(gender);


CREATE​ I
SELECT​ *​ ​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'F';

//lets​ ​count​ ​how​ ​many​ ​records​ ​are​ ​in​ ​the​ ​query​ ​outcome

SELECT​ ​count(*)​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'F';

Tracing​ ​on;

select​ ​*​ ​from​ ​users​ ​where​ ​gender​ ​=​ ​'F';​ ​ ​//​ ​to​ ​see​ ​all​ ​the​ ​operations​ ​C*​ ​is​ ​doing​ ​in​ ​the
background​ ​(even​ ​for​ ​a​ ​1​ ​node​ ​cluster​ ​you​ ​can​ ​see​ ​loads​ ​of​ ​operations).

CAN​ ​YOU​ ​SEE​ ​THAT​ ​THERE​ ​ARE​ ​2​ ​TABLES​ ​queried​ ​ON​ ​THE​ ​NODE​ ​IN​ ​ORDER​ ​TO​ ​GET
THE​ ​DATA:​ ​on​ ​the​ ​index​ ​table​ ​and​ ​on​ ​the​ ​users​ ​table
​ ​Executing​ ​single-partition​ ​query​ ​on​ ​users.users_by_gender
​ ​Executing​ ​single-partition​ ​query​ ​on​ ​users​ ​--​ ​appears​ ​3​ ​times.​ ​Can​ ​you​ ​tell​ ​me​ ​why?
We​ ​will​ ​come​ ​back​ ​on​ ​this​ ​point​ ​in​ ​section​ ​Indexing​ ​in​ ​C*.

4.Collections

SETS

ALTER​ ​TABLE​ ​users​ ​ADD​ ​emails​ ​set​ ​<text>;

Let’s​ ​insert​ ​some​ ​data​ ​in​ ​the​ ​set:

insert​ ​into​ ​users​ ​(uid,name,age,emails,gender,occupation,zipcode)


values(uuid(),'Lilian',
18,{'[email protected]','[email protected]'},'F',​ ​'student',​ ​11503);

Select​ ​*​ ​from​ ​users;

Let’s​ ​add​ ​some​ ​data​ ​to​ ​existing​ ​entries:

UPDATE​ ​users​ ​SET​ ​emails​ ​=​ ​emails​ ​+​ ​{'[email protected]'}​ ​WHERE​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​AND​ ​name​ ​=​ ​'Elaine';
[email protected] 

Let’s​ ​delete​ ​some​ ​data​ ​from​ ​existing​ ​entries:

UPDATE​ ​users​ ​SET​ ​emails​ ​=​ ​emails​ ​-​ ​{'[email protected]'}​ ​WHERE​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​AND​ ​name​ ​=​ ​'Elaine';

SELECT​ ​*​ ​FROM​ ​USERS;

create​ ​index​ ​on​ ​users(emails);

select​ ​*​ ​from​ ​users​ ​where​ ​emails​ ​CONTAINs​ ​'[email protected]';

LISTS

LET’S​ ​ADD​ ​A​ ​LIST​ ​COLUMN​ ​TO​ ​TABLE​ ​USERS:

ALTER​ ​TABLE​ ​users​ ​ADD​ ​top_movies​ ​list<text>;

LET’S​ ​ADD​ ​SOME​ ​DATA​ ​TO​ ​EXISTING​ ​ENTRIES:

update​ ​users​ ​set​ ​top_movies​ ​=​ ​['titanic','gatsby']​ ​where​ ​uid​ ​=


89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​*​ ​from​ ​users;

update​ ​users​ ​set​ ​top_movies​ ​=​ ​['romeo​ ​and​ ​juliet']+​ ​top_movies​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​*​ ​from​ ​users;

update​ ​users​ ​set​ ​top_movies​ ​=​ ​top_movies​ +


​ ​ ​['the​ ​renevant']​ ​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ n ​ ame​ ​=​ ​'Elaine';

Let’s​ ​replace​ ​position​ ​1​ ​Titanic​ ​with​ ​Blood​ ​Diamonds:

update​ ​users​ ​set​ ​top_movies[1]='blood​ ​diamonds'​ w


​ here​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ' ​ Elaine';

Select​ ​*​ ​from​ ​users;

Let’s​ ​delete​ ​entry​ ​1​ ​from​ ​the​ ​list.​ ​//blood​ ​diamonds

delete​ ​top_movies[1]​ ​from​ ​users​ ​ ​where​ ​uid​ ​=


89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';
[email protected] 

select​ ​name​ ​,​ ​top_movies​ ​from​ ​users​ ​where​ ​uid​ =



89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ = ​ ​ ​'Elaine';​ ​//​ ​we​ ​select​ ​only​ ​the
collection

You​ ​can​ ​index​ ​a​ ​collection,​ ​let’s​ ​query​ ​on​ ​a​ ​movie​ ​name:

create​ ​index​ ​top_movies​ ​on​ ​users(top_movies);

select​ ​*​ ​from​ ​users​ ​where​ ​top_movies​ ​contains​ ​'gatsby';

MAPS
LET’S​ ​ADD​ ​A​ ​MAP​ ​COLUMN​ ​TO​ ​TABLE​ ​USERS:

ALTER​ ​TABLE​ ​users​ ​ADD​ ​movies​ ​map<timestamp,​ ​text>;

LET’S​ ​INSERT​ ​DATA:

update​ ​users​ ​set​ ​movies​ ​=​ ​{'2016-06-05':'rated


movie','2016-06-06':'screened​ ​trailer'}​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​movies​ ​from​ ​users;

You​ ​can​ ​use​ ​a​ ​collection​ ​as​ ​a​ ​primary​ ​key.

STOP​ ​here​ ​for​ ​theory​ ​update.

5.​ ​Copy​ ​data​ ​from/to​ ​a​ ​CSV​ ​file

Let’s​ ​start​ ​working​ ​on​ ​some​ ​real​ ​data.

5.1.​ ​Copy​ ​from​ ​a​ ​CSV​ ​file

Drop​ ​table​ ​users;

Let’s​ ​re-create​ ​table​ ​users​ ​with​ ​below​ ​syntax​ ​and​ ​import​ ​data​ ​from​ ​a​ ​CSV​ ​file.

CREATE​ ​TABLE​ ​users(


​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​uid​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​age​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​gender​ ​text,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​occupation​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​zipcode​ ​text,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​PRIMARY​ ​KEY​ ​(uid));
[email protected] 

WE​ ​HAVE​ ​IN​ ​THE​ ​DATA​ ​DIRECTORY​ ​(CLOUD​ ​OR​ ​LOCAL​ ​INSTALLATIONS)​ ​THE​ ​CSV
FILES​ ​THAT​ ​WE​ ​WILL​ ​BE​ ​WORKING​ ​WITH.​ ​WE​ ​NEED​ ​Users.csv

copy​ ​users(uid,gender,age,occupation,zipcode)​ ​from​ ​'/data/users.csv'​ ​with


header​ ​=​ ​true;

select​ ​*​ ​from​ ​users​ ​limit​ ​10;

5.2.​ ​Copy​ ​to​ ​a​ ​CSV​ ​file

COPY​ ​users(uid,gender,age)​ ​TO​ ​'/data/yourname.csv'​ ​WITH​ ​HEADER=TRUE;

6.1​ ​Collections​ ​as​ ​Primary​ ​key

Did​ ​you​ ​know​ ​we​ ​can​ ​have​ ​a​ ​collection​ ​as​ ​part​ ​Primary​ ​Key?​ ​So​ ​if​ ​we​ ​want​ ​to
know​ ​movies​ ​by​ ​genres​ ​-​ ​we​ ​could​ ​define​ ​Genre​ ​as​ ​part​ ​of​ ​the​ ​Primary​ ​key.

CREATE​ ​TABLE​ ​movies_by_genres​ ​(


movieid​ ​int,
title​ ​text,
genres​ ​frozen​ ​<set<text>>,
Year​ ​int,
PRIMARY​ ​KEY​ ​(genres,​ ​movieid));

copy​ ​movies_by_genres(movieid,title,genres,year)​ ​from​ ​'/data/movies.csv'​ ​with​ ​header​ ​=


true;

Select​ ​*​ ​from​ ​movies_by_genres​ ​limit​ ​10;

6.2.​ ​Lets​ ​also​ ​do​ ​an​ ​exercise​ ​on​ ​User​ ​Defined​ ​Types

CREATE​ ​TYPE​ ​address​ ​(


Street​ ​text,
Number​ ​int,
Flat​ ​text);

CREATE​ ​TABLE​ ​users_udt​ ​(


Uid​ ​int,
Address​ ​frozen<address>,
Primary​ ​key​ ​(uid));
[email protected] 

//​ ​When​ ​using​ ​the​ ​frozen​ ​keyword,​ ​you​ ​cannot​ ​update​ ​parts​ ​of​ ​a
user-defined​ ​type​ ​value.​ ​The​ ​entire​ ​value​ ​must​ ​be​ ​overwritten.​ ​Cassandra
treats​ ​the​ ​value​ ​of​ ​a​ ​frozen,​ ​user-defined​ ​type​ ​like​ ​a​ ​blob.

INSERT​ ​INTO​ ​users_udt(uid,address)​ ​VALUES​ ​(1,​ ​{street​ ​:​ ​'Pipera',​ ​number​ ​:


34,​ ​flat​ ​:​ ​'34C'});

Select​ ​*​ ​from​ ​users_udt;

//Let’s​ ​see​ ​if​ ​we​ ​can​ ​select​ ​only​ ​parts​ ​of​ ​the​ ​UDT
select​ ​address.street​ ​from​ ​users_udt​ ​where​ ​uid​ ​=​ ​1;

//​ ​Let’s​ ​see​ ​if​ ​we​ ​can​ ​update​ ​parts​ ​of​ ​the​ ​UDT

update​ ​users_udt_1​ ​set​ ​address.street​ ​=​ ​'Vitan'​ ​where​ ​uid​ ​=​ ​1;​ ​//​ ​will​ ​not
work

Important​:From​ ​3.4.2:​ ​User-defined​ ​types​ ​may​ ​now​ ​be​ ​stored​ ​in​ ​a​ ​non-frozen
form,​ ​allowing​ ​individual​ ​fields​ ​to​ ​be​ ​updated​ ​and​ ​deleted​ ​in​ ​UPDATE
statements​ ​and​ ​DELETE​ ​statements,​ ​respectively.​ ​(CASSANDRA-7423).​ ​Let’s
test.

CREATE​ ​TABLE​ ​users_udt_1​ ​(


Uid​ ​int,
Address​ ​address,
Primary​ ​key​ ​(uid));

INSERT​ ​INTO​ ​users_udt_1(uid,address)​ ​VALUES​ ​(1,​ ​{street​ ​:​ ​'Pipera',​ ​number


:​ ​34,​ ​flat​ ​:​ ​'34C'});

Select​ ​*​ ​from​ ​users_udt_1;

update​ ​users_udt_1​ ​set​ ​address.street​ ​=​ ​'Vitan'​ ​where​ ​uid​ ​=​ ​1;

Select​ ​*​ ​from​ ​users_udt_1;


[email protected] 

​ ​7.​ ​Let’s​ ​do​ ​some​ ​data​ ​deletes:

SELECT​ ​*​ ​from​ ​users​ ​limit​ ​10;

Delete​ ​value​ ​of​ ​a​ ​cell:


select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

delete​ ​age​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

Delete​ ​a​ ​partition:


delete​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

Before​ ​we​ ​move​ ​on

//see​ ​table​ ​statistics

nodetool​ ​cfstats​ ​metro_systems.users

//​ ​see​ ​number​ ​of​ ​keys​ ​=​ ​number​ ​of​ ​partitions.

​ fhistograms​ ​ ​metro_systems​ ​users


nodetool​ c
or
nodetool​ t​ ablehistograms​ ​metro_systems​ ​users

Select​ ​token(uid)​ ​from​ ​users1​ ​where​ ​uid=1;​ ​//​ ​//​ ​we​ ​want​ ​to​ ​see​ ​which​ ​token​ ​is
allocated​ ​for​ ​this​ ​partition

Basic​ ​data​ ​model​ ​--​ ​update​ ​theory

Exercise​ ​8:​ ​Time​ ​series:​ ​weather​ ​stations


CREATE​ ​TABLE​ ​temperature​ ​(
w_id​ ​text,
event_time​ ​timestamp,
temperature​ ​int,
PRIMARY​ ​KEY​ ​(w_id,event_time)
)​ ​with​ ​clustering​ ​order​ ​by​ ​(event_time​ ​DESC);

Let’s​ ​import​ ​data​ ​in​ ​the​ ​temperature​ ​table:


[email protected] 

copy​ ​temperature(w_id,event_time,temperature)​ ​from​ ​'/data/weatherdata.csv'​ ​with


header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​100000;

Select​ ​*​ ​from​ ​temperature​ ​limit​ ​30;

Let’s​ ​look​ ​for​ ​the​ ​data​ ​(time​ ​&​ ​temp)​ ​on​ ​a​ ​single​ ​weather​ ​station,​ ​how​ ​would​ ​the
query​ ​look​ ​like?

select​ ​event_time,​ ​temperature​ ​from​ ​temperature​ ​where​ ​w_id​ ​=​ ​'PNOSCPQCNX';

What​ ​about​ ​all​ ​the​ ​temperatures​ ​between​ ​2016-01-10​ ​03:49:46​ ​and​ ​2016-01-11
16:29:59?

select​ ​temperature​ ​from​ ​temperature​ ​where​ ​w_id​ ​=​ ​'PNOSCPQCNX'​ ​AND


event_time​ ​>​ ​'2016-01-10​ ​03:49:46'​ ​AND​ ​event_time​ ​<​ ​'2016-01-11​ ​16:29:59';

Average​ ​temperatures​ ​between​ ​2016-01-10​ ​03:49:46​ ​and​ ​2016-01-11​ ​16:29:59?

select​ ​avg(temperature)​ ​from​ ​temperature​ ​where​ ​w_id​ = ​ ​ ​'PNOSCPQCNX'​ ​AND


event_time​ ​>​ ​'2016-01-10​ ​03:49:46'​ ​AND​ ​event_time​ ​<​ '​ 2016-01-11​ ​16:29:59';

Create​ ​table​ ​temperature_by_day​ ​(​ ​solution​ ​for​ ​wide​ ​rows)

CREATE​ ​TABLE​ ​temperature_by_day​ ​(


​ ​ ​ ​w_id​ ​text,
​ ​ ​ ​date​ ​date,
​ ​ ​ ​event_time​ ​timestamp,
​ ​ ​ ​temperature​ ​int,
​ ​ ​ ​PRIMARY​ ​KEY​ ​((w_id,date),event_time)
);

Let’s​ ​import​ ​data​ ​in​ ​the​ ​new​ ​table:

copy​ ​temperature_by_day​ ​(w_id,event_time,date,temperature)​ ​from


'/data/weatherdata-ext.csv'​ ​with​ ​header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​100000;

Query​ ​the​ ​weather​ ​data​ ​for​ ​the​ ​single​ ​day​ ​and​ ​then​ ​2​ ​consecutive​ ​days
SELECT​ ​*​ ​FROM​ ​temperature_by_day​ ​WHERE​ ​w_id='FFOERIRWOL'​ ​AND​ ​date='2015-03-19';
SELECT​ ​*​ ​FROM​ ​temperature_by_day​ ​WHERE​ ​w_id='FFOERIRWOL'​ ​AND​ ​date​ ​in
('2015-03-19','2015-03-18');
[email protected] 

Let’s​ ​see​ ​how​ ​to​ ​query​ ​using​ ​a​ ​non​ ​primary​ ​key​ ​column:

This​ ​query​ ​will​ ​not​ ​work​ ​-​ ​because​ ​we​ ​are​ ​trying​ ​to​ ​query​ ​without​ ​partition​ ​key:

select​ ​*​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ ​29;

In​ ​order​ ​to​ ​work​ ​either​ ​we​ ​allow​ ​filtering​ ​of​ ​data:

select​ ​*​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ 2 ​ 9​ a​ llow​ ​filtering;​ ​//​ ​use
this​ ​with​ ​caution,​ ​brings​ ​all​ ​the​ ​table​ ​data​ ​locally​ i​ n​ o​ ne​ ​coordinator.

Or​ ​we​ ​create​ ​an​ ​index​ ​on​ ​temperature

​ ndex​ ​on​ ​temperature_by_day(temperature);


Create​ i
select​ *​ ​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ ​29

Deleting​ ​old​ ​entries​ ​-​ ​ ​WE​ ​CAN​ ​SET​ ​EXPIRING​ ​DATA​ ​AT​ ​TABLE​ ​LEVEL​ ​OR
PER​ ​ENTRY

EXPIRE​ ​DATA​ ​PER​ ​TABLE

CREATE​ ​TABLE​ ​latest_temperatures​ ​(


w_id​ ​text,
event_time​ ​timestamp,
temperature​ ​int,
PRIMARY​ ​KEY​ ​(w_id,event_time),
)​ ​WITH​ ​CLUSTERING​ ​ORDER​ ​BY​ ​(event_time​ ​DESC)​ ​AND​ ​default_time_to_live​ ​=​ ​10;

copy​ ​latest_temperatures(w_id,event_time,temperature)​ ​from​ ​'/data/weatherdata.csv'


with​ ​header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​10000;

Select​ ​*​ ​from​ ​latest_temperatures

Repeat​ ​select​ ​in​ ​10​ ​sec

EXPIRE​ ​DATA​ ​PER​ ​ENTRY

INSERT​ ​INTO​ ​latest_temperatures(w_id,event_time,temperature)​ ​VALUES


('12345',​ ​'2015-09-17​ ​21:29:35',​ ​20)​ ​USING​ ​ ​TTL​ ​20;

Select​ ​*​ ​from​ ​latest_temperatures;​ ​//​ ​in​ ​20​ ​s

You might also like