8-Wide Column Database and Document Database-25!01!2025
8-Wide Column Database and Document Database-25!01!2025
• The SQL NTILE() function partitions a logically ordered dataset into a number of buckets demonstrated
by the expression and allocates the bucket number to each row.
• The buckets are numbered from 1 through expression where the expression value must result in a
positive integer value for each partition.
• For example, the following query will allocate rows to three buckets.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, NTILE(3) OVER (PARTITION BY DEPTNAME ORDER BY
SALARY) AS BUCKETS FROM workers;
ENAME EID DEPTID DEPTNAME SALARY BUCKETS
Niya 38 308 HR 45,000 1
Bobby 17 308 HR 58,000 2
Reyon 16 305 Testing 30,000 1
Jerry 15 305 Testing 35,000 2
Alice 18 305 Testing 45,000 3
John 11 301 Workshop 30,000 1
Tom 24 301 Workshop 50,000 2
Bob 22 301 Workshop 51,000 3
1
NTILE ()
• If PARTITION BY clause is excluded from the above query, then it will give results as
follows:
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, NTILE(3) OVER (ORDER BY SALARY)
AS BUCKETS FROM workers;
2
CUME_DIST ()
• The SQL window function CUME_DIST() returns the cumulative distribution of a value within a partition
of values.
• The cumulative distribution of a value calculated by the number of rows with values less than or equal to
(<=) the current row’s value is divided by the total number of rows.
• N/totalrows
• where N is the number of rows with the value less than or equal to the current row value and total rows is the number of
rows in the group or result set. Function returns value having a range between 0 and 1.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, CUME_DIST() OVER (PARTITION BY DEPTNAME ORDER
BY SALARY) AS CUME_DIST_VALUE FROM workers;
ENAME EID DEPTID DEPTNAME SALARY CUME_DIST_VALUE
Niya 38 308 HR 45,000 0.5
Bobby 17 308 HR 58,000 1
Reyon 16 305 Testing 30,000 0.3333333333333333
Jerry 15 305 Testing 35,000 0.6666666666666666
Alice 18 305 Testing 45,000 1
John 11 301 Workshop 30,000 0.3333333333333333
Tom 24 301 Workshop 50,000 0.6666666666666666
Bob 22 301 Workshop 51,000 1 3
ROW_NUMBER ()
• The SQL window function ROW_NUMBER() is used to display a row number for each
row within a specified partition.
• SELECT ROW_NUMBER() OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS
ROW_NUM, DEPTNAME, DEPTID, SALARY, ENAME, EIDFROM workers;
4
AVG()
• A window function applies function across a set of table rows that are related to the current row.
• The window function does not cause rows to be clustered into a single output row; the rows maintain their separate
identities. The window function is able to access more than just the current row of the query result.
• To calculate average value of each partition, we can use window function AVG(). To calculate average salary in each
department, we can write the query as follows:
• SELECT AVG(SALARY) OVER (PARTITION BY DEPTNAME) AS AVG_SALARY, DEPTNAME, DEPTID, SALARY, ENAME, EID
FROM workers;
AVG_SALARY DEPTNAME DEPTID SALARY ENAME EID
51,500.0000 HR 308 45,000 Niya 38
51,500.0000 HR 308 58,000 Bobby 17
36,666.6667 Testing 305 35,000 Jerry 15
36,666.6667 Testing 305 45,000 Alice 18
36,666.6667 Testing 305 30,000 Reyon 16
43,666.6667 Workshop 301 30,000 John 11
43,666.6667 Workshop 301 50,000 Tom 24
43,666.6667 Workshop 301 51,000 Bob 22
5
AVG()
• Also, moving aggregate can be calculated by adding ORDER BY clause along with PARTITION BY in
window function with AVG().
• SELECT AVG(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS AVG_SALARY,
DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;
6
SUM()
• The SUM() window function returns the sum of input column or the expression across input values in
each partition.
• For example, to calculate sum of salaries of workers in each department, we can write the query as
follows:
• SELECT SUM(SALARY) OVER (PARTITION BY DEPTNAME) AS SUM_SALARY, DEPTNAME, DEPTID, SALARY,
ENAME, EID FROM workers;
8
COUNT()
• The COUNT() window function counts the number of rows defined by the expression in
partition. To count employees in each department, we can write the query as follows:
• SELECT COUNT(ENAME) OVER (PARTITION BY DEPTNAME) AS COUNT_ENAME,
DEPTNAME,DEPTID, SALARY, ENAME, EID FROM WORKERS;
9
MIN() and MAX()
• The aggregate window functions MIN() and MAX() return the minimum and maximum values
of an expression within a specified window.
• The following query will return the maximum and minimum salaries of workers in each
department.
• SELECT DEPTNAME, DEPTID, SALARY, ENAME, EID, MAX(SALARY) OVER (PARTITION BY
DEPTNAME) AS MAX_SAL, MIN(SALARY) OVER (PARTITION BY DEPTNAME) AS MIN_SAL FROM
workers;
DEPTNAME DEPTID SALARY ENAME EID MAX_SAL MIN_SAL
HR 308 45,000 Niya 38 58,000 45,000
HR 308 58,000 Bobby 17 58,000 45,000
Testing 305 35,000 Jerry 15 45,000 30,000
Testing 305 45,000 Alice 18 45,000 30,000
Testing 305 30,000 Reyon 16 45,000 30,000
Workshop 301 30,000 John 11 51,000 30,000
Workshop 301 50,000 Tom 24 51,000 30,000
Workshop 301 51,000 Bob 22 51,000 30,000
10
LEAD()
• SQL LEAD() function has a capacity that gives admittance to a column at a predefined actual
counterbalance which follows the current row.
• For example, by utilizing the LEAD() function, from the current line, you can get information of the
following line, or the second line that follows the current line, or the third line that follows the current
line, etc.
• The LEAD() function syntax is given below:
• LEAD(return_value [,offset[, default ]])
OVER (
PARTITION BY ...
ORDER BY ...
)
• In the above syntax, return_value specifies the return value of the following row offsetting from the
current row. Offset represents the number of rows forward from the current row from which to access
data.
• The offset must be a nonnegative integer. If the offset is not specified, then it is set default to 1.
• When offset goes beyond the scope of the partition, then function returns default value. If the value is
not specified, then NULL is returned. 11
LEAD()
• The LEAD() function applies to the partitions that are created by the PARTITION BY clause. If PARTITION
BY clause is not used, then the whole result set is treated as a single partition.
• The sorting of the rows in each partition is done by the ORDER BY clause to which the LEAD() function
applies. The following query will extract the salary of the next person in the department, and if the next
person is not available in the list, then it will return a NULL value.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, LEAD(SALARY) OVER (PARTITION BY DEPTNAME
ORDER BY SALARY) AS NEXT_PERSON_SALARY FROM workers;
13
FIRST_VALUE()
• The SQL window function FIRST_VALUE() returns the first value in an ordered group of a result set or
window frame.
• The following query returns the first salary value in each department ordered by salary.
• SELECT FIRST_VALUE(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS FIRST_ROW,
DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;
14
LAST_VALUE()
• The SQL window function LAST_VALUE() returns the last value in an ordered group of a result set.
• LAST_VALUE() function used in SQL server is a type of window function that results the last value in an
ordered partition of the given data set.
• The following query returns the last salary value in each department ordered by salary.
• SELECT LAST_VALUE(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS LAST_ROW,
DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;
17
LAG()
• We can use a SQL window function LAG() to access previous row’s data based on
defined offset value. It works similar to a LEAD() function.
• In the SQL LEAD() function, we access the values of subsequent rows, but in LAG()
function, we access previous row’s data.
• It is useful to compare the current row value from the previous row value.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, LAG(SALARY) OVER (PARTITION BY
DEPTNAME ORDER BY SALARY) AS PREVIOUS_PERSON_SALARY FROM workers;
• The above query finds the salary of the previous person in each department based on
logically sorted salary value.
• As no previous row is available for the first row in each department, it returns a NULL
value.
18
LAG()
ENAME EID DEPTID DEPTNAME SALARY PREVIOUS_PERSON_SALARY
Niya 38 308 HR 45,000 NULL
Bobby 17 308 HR 58,000 45,000
Reyon 16 305 Testing 30,000 NULL
Jerry 15 305 Testing 35,000 30,000
Alice 18 305 Testing 45,000 35,000
John 11 301 Workshop 30,000 NULL
Tom 24 301 Workshop 50,000 30,000
Bob 22 301 Workshop 51,000 50,000
19
Preparing Data from Analytics Tool
• One of the primary steps performed for data science is the cleaning of the
dataset you are working with.
• Various SQL queries can be used to clean, update, and filter data, by eliminating
redundant and unwanted records. This can be done with the different SQL
clauses like CASE WHEN, COALESCE, NULLIF, LEAST/GREATEST, Casting, and
DISTINCT.
Sales Table
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
20
CASE
WHEN
• The CASE statement goes through various conditions specified with WHEN clause and returns a
value when the first condition is met.
• It works like nested IF-THEN-ELSE statement. Once a condition is true, it will return the value
specified after THEN. Value in the ELSE clause is returned, if no conditions are true.
• It returns NULL when no conditions are true, and no ELSE part is specified in the query.
• Suppose we fetch all data of the above sales table and want to add an extra column that labels as
summary which categorizes sales into More, Less, and Avg, this table can be created using a CASE
statement as follows:
• SELECT *,
CASE
WHEN quantity >= 10 THEN 'More’
WHEN quantity >= 6 THEN 'Avg’
ELSE 'Less’
END AS summary
FROM sales; 21
CASE
WHEN
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
22
COALESCE
• Some records of database may consist of NULL values, but while applying
statistics to these datasets, you may need to replace these NULL values with
some other data. This can be done effectively by the COALESCE function.
• The first parameter to this function is a column that may consist of NULL, and the
second represents value that replaces NULL.
• It replaces all NULL values specified in column by the second default value given
in the function.
• The following example replaces NULL by −1 in the quantity column.
• SELECT
customer_name ,product_id,
COALESCE(quantity, -1) AS quantity
FROM sales;
23
COALESCE
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
24
NULLIF
• NULLIF function takes two parameters and will return NULL if the first parameter
value equals the second value else returns the first parameter.
• As an example, imagine that we want to replace product_id value 11 by NULL.
This could be done with the following query:
• SELECT sale_no, customer_name,
NULLIF(product_id, 11) AS product_id
FROM sales;
25
NULLIF
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
26
LEAST/GREATEST
• The LEAST and GREATEST are frequently used functions for data cleaning.
• These functions return the least and greatest values from the given set of
elements, respectively. These functions are useful to replace value in list,
especially if it is too high or low.
• For example, minimum price needs to be 10,000 in the above table. This can be
done by the following query.
• Price 8,000 is replaced by value 10,000 in the last row, as 8,000 is less than
10,000, and it replaces it by max value among these two.
• SELECT
sale_no, product_id, quantity,
GREATEST(10000, price) as price
FROM sales;
27
GREATEST
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
28
LEAST
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
29
DISTINCT
• The DISTINCT keyword returns only distinct values in the specified column value
sets.
• For example, to extract all the unique names in the sales table, you could write
the following query:
• SELECT
DISTINCT customer_name
FROM sales;
• DISTINCT clause can also be applied to multiple columns to get the distinct
combinations of the specified column.
• The above query gives the following result: It removed duplicate names from the
customer_name column.
30
DISTINCT
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom
customer_name
SELECT John
DISTINCT customer_name Anna
FROM sales; Tom
Nora
31
Advanced NoSQL for Data
Science
• NoSQL, which means “not only SQL”, is an alternative to relational databases in which data is
stored in tables and has a fixed data schema.
• NoSQL is a database design that can accommodate various data models, including key-value,
document, columnar, and graph formats.
• NoSQL databases are very useful for working with large distributed data.
• The NoSQL databases are built in the early 2000s to deal with large-scale database clustering in
web and cloud applications.
• NoSQL has a flexible schema, unlike the traditional relational database model. All rows can have
different structures or attributes.
• NoSQL databases are found to be very useful for handling really big data tasks because it follows
the Basically Available, Soft State, Eventual Consistency (BASE) approach instead of Atomicity,
Consistency, Isolation, and Durability − commonly known as ACID properties.
• Two major drawbacks of SQL are rigidity when adding columns and attributes to tables and
slow performance when many tables need to be joined and when tables store a large amount
of data.
• NoSQL databases tried to overcome these two biggest drawbacks of relational databases.
• NoSQL offers a more flexible, schema-free solution that can work with unstructured data. 32
Why NoSQL?
• NoSQL supports unstructured data or semi-structured data.
• In many applications, an attribute usually needs to be added on the fly, for
specific rows, but not every row, and may be of different types than attributes in
the rows.
• Now let us explore some NoSQL features to understand why you should choose
NoSQL databases for data science.
• Features:
• It is not using the relational model to store data.
• NoSQL running well on clusters.
• It is mostly open-source.
• NoSQL is capable to handle a large amount of social media data.
• NoSQL is schema-less.
33
Document Databases for Data
Science
• Document-based NoSQL databases store the data in the JSON object format. Each
document has key-value pairs like structures.
• The document-based NoSQL databases are simple for engineers as they map
items as a JSON object.
• JSON is a very common data format truly adaptable by web developers and
permits us to change the structure whenever required.
• Some examples of document-based NoSQL databases are CouchDB, MongoDB,
OrientDB, and BaseX.
34
JSON Document Format
{
"_id": 1,
"name" : { "first" : "John", "last" : "Backus" },
"contribs" : [ "Fortran", "ALGOL", "Form", "FP" ],
"awards" : [
{
"award" :"Dowell Award",
"year" : 1988,
"by" :"Computer Society"
},
{
"award" :"First Prize",
"year" : 1993,
"by" : "National Academy of Engineering“
}
]
}
35
Wide Column Databases for Data Science
• Similar to any relational database, this wide-column database stores the data in records, but it
can also store very large numbers of dynamic columns.
• It groups the dynamically added columns into column families.
• Instead of having multiple tables like relational databases, we have multiple column families in
wide-column databases.
• Examples of wide-column types of databases are Cassandra and Hbase.
36
Graph Databases for Data Science
• Graph database stores the data in the form of nodes and edges.
• The node stores information about the main entities like people, places, and products, and the
edge stores the relationships between them.
• Graph database is very useful to find out the pattern or relationship among data like a social
network and recommendation engines.
• Examples of graph databases are Neo4j and Amazon Neptune.