0% found this document useful (0 votes)
5 views3 pages

Lab Exam Notes

The document provides key points about three technologies: Cassandra, Hive, and Spark. It highlights Cassandra's wide-column store model, schema enforcement, and primary key structure, along with Hive's database organization and data storage format. Additionally, it outlines Spark's interactive shell usage and functions for data manipulation.

Uploaded by

Sagar Talagatti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views3 pages

Lab Exam Notes

The document provides key points about three technologies: Cassandra, Hive, and Spark. It highlights Cassandra's wide-column store model, schema enforcement, and primary key structure, along with Hive's database organization and data storage format. Additionally, it outlines Spark's interactive shell usage and functions for data manipulation.

Uploaded by

Sagar Talagatti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Cassandra

Important Points:

1. It uses Wide-Column Store data model.


2. Unlike document-based NoSQL databases (like MongoDB), Cassandra enforces a
schema.
3. A keyspace in Cassandra is similar to a database (collection of tables) in SQL
world.
4. In replication, class SimpleStrategy is for single-node setups (use
NetworkTopologyStrategy for production).
5. Here, we can only query based on the Primary Key components, or Indexes. If
we want to query based on any other column, then we must use the keyword
ALLOW FILTERING.
6. PRIMARY KEY in Cassandra has 2 parts:
a. Partitioning Key – Compulsory, first part of PRIMARY KEY definition and
used to divide the rows among nodes. It ensures that all rows with the
same Partitioning Key are stored on the same node.
b. Clustering Key – Optional, second part of PRIMARY KEY definition and used
to sort the rows in a partition.
7. A secondary index allows querying non-primary key columns. We can create an
Index using CREATE INDEX ON <table_name> (column_name);
8. When querying, the use of ALLOW FILTERING is discouraged because Cassandra
fetches all rows and filters them in memory which is very slow.
9. In aggregate functions, it only supports COUNT(). Other functions like SUM(),
AVG(), MIN() and MAX() are not supported, and an alternative to use them is to
use Spark with Cassandra.
10. Cassandra only supports COUNT(*), which counts all rows that match the
query (including rows where the column is NULL). Cassandra does NOT support
COUNT(column_name), unlike SQL databases.
Hive
Important Points:
1. A database in Hive is a namespace that groups related tables.
2. When creating a table in Hive, the following clauses define how data is stored
and structured:
ROW FORMAT DELIMITED -> This tells Hive that the table data is stored in a
structured text format with specific delimiters for separating fields (columns).
FIELDS TERMINATED BY ‘,’ -> Indicates that fields are separated by a comma,
making it ideal for CSV (Comma-Separated Values) files.
STORED AS TEXTFILE -> This tells Hive that the data is stored as plain text files
in HDFS.
3. To insert row(s) in Hive, we can use the INSERT INTO TABLE <table_name>
VALUES (<value_list_row1>), (<value_list_row2>), …, (<value_list_rown>);
Spark

Important Points:

1. Use spark-shell to launch an interactive shell using Scala.


2. In the shell, sc gives access to the SparkContext.
3. flatMap() will flatten the results while map() will keep them nested.
4. If x is a tuple with say 2 elements, we can use x._1 and x._2 to access the first
and second elements respectively.

You might also like