The document provides key points about three technologies: Cassandra, Hive, and Spark. It highlights Cassandra's wide-column store model, schema enforcement, and primary key structure, along with Hive's database organization and data storage format. Additionally, it outlines Spark's interactive shell usage and functions for data manipulation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views3 pages
Lab Exam Notes
The document provides key points about three technologies: Cassandra, Hive, and Spark. It highlights Cassandra's wide-column store model, schema enforcement, and primary key structure, along with Hive's database organization and data storage format. Additionally, it outlines Spark's interactive shell usage and functions for data manipulation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3
Cassandra
Important Points:
1. It uses Wide-Column Store data model.
2. Unlike document-based NoSQL databases (like MongoDB), Cassandra enforces a schema. 3. A keyspace in Cassandra is similar to a database (collection of tables) in SQL world. 4. In replication, class SimpleStrategy is for single-node setups (use NetworkTopologyStrategy for production). 5. Here, we can only query based on the Primary Key components, or Indexes. If we want to query based on any other column, then we must use the keyword ALLOW FILTERING. 6. PRIMARY KEY in Cassandra has 2 parts: a. Partitioning Key – Compulsory, first part of PRIMARY KEY definition and used to divide the rows among nodes. It ensures that all rows with the same Partitioning Key are stored on the same node. b. Clustering Key – Optional, second part of PRIMARY KEY definition and used to sort the rows in a partition. 7. A secondary index allows querying non-primary key columns. We can create an Index using CREATE INDEX ON <table_name> (column_name); 8. When querying, the use of ALLOW FILTERING is discouraged because Cassandra fetches all rows and filters them in memory which is very slow. 9. In aggregate functions, it only supports COUNT(). Other functions like SUM(), AVG(), MIN() and MAX() are not supported, and an alternative to use them is to use Spark with Cassandra. 10. Cassandra only supports COUNT(*), which counts all rows that match the query (including rows where the column is NULL). Cassandra does NOT support COUNT(column_name), unlike SQL databases. Hive Important Points: 1. A database in Hive is a namespace that groups related tables. 2. When creating a table in Hive, the following clauses define how data is stored and structured: ROW FORMAT DELIMITED -> This tells Hive that the table data is stored in a structured text format with specific delimiters for separating fields (columns). FIELDS TERMINATED BY ‘,’ -> Indicates that fields are separated by a comma, making it ideal for CSV (Comma-Separated Values) files. STORED AS TEXTFILE -> This tells Hive that the data is stored as plain text files in HDFS. 3. To insert row(s) in Hive, we can use the INSERT INTO TABLE <table_name> VALUES (<value_list_row1>), (<value_list_row2>), …, (<value_list_rown>); Spark
Important Points:
1. Use spark-shell to launch an interactive shell using Scala.
2. In the shell, sc gives access to the SparkContext. 3. flatMap() will flatten the results while map() will keep them nested. 4. If x is a tuple with say 2 elements, we can use x._1 and x._2 to access the first and second elements respectively.