0% found this document useful (0 votes)
5 views37 pages

8-Wide Column Database and Document Database-25!01!2025

The document explains various SQL window functions including NTILE(), CUME_DIST(), ROW_NUMBER(), AVG(), SUM(), COUNT(), MIN(), MAX(), and LEAD(). Each function is described with its purpose, syntax, and examples using a dataset of workers, demonstrating how to partition and analyze data based on different criteria. The document provides SQL queries to illustrate how these functions can be applied to calculate distributions, averages, sums, and differences within specified partitions.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views37 pages

8-Wide Column Database and Document Database-25!01!2025

The document explains various SQL window functions including NTILE(), CUME_DIST(), ROW_NUMBER(), AVG(), SUM(), COUNT(), MIN(), MAX(), and LEAD(). Each function is described with its purpose, syntax, and examples using a dataset of workers, demonstrating how to partition and analyze data based on different criteria. The document provides SQL queries to illustrate how these functions can be applied to calculate distributions, averages, sums, and differences within specified partitions.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

NTILE ()

• The SQL NTILE() function partitions a logically ordered dataset into a number of buckets demonstrated
by the expression and allocates the bucket number to each row.
• The buckets are numbered from 1 through expression where the expression value must result in a
positive integer value for each partition.
• For example, the following query will allocate rows to three buckets.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, NTILE(3) OVER (PARTITION BY DEPTNAME ORDER BY
SALARY) AS BUCKETS FROM workers;
ENAME EID DEPTID DEPTNAME SALARY BUCKETS
Niya 38 308 HR 45,000 1
Bobby 17 308 HR 58,000 2
Reyon 16 305 Testing 30,000 1
Jerry 15 305 Testing 35,000 2
Alice 18 305 Testing 45,000 3
John 11 301 Workshop 30,000 1
Tom 24 301 Workshop 50,000 2
Bob 22 301 Workshop 51,000 3
1
NTILE ()
• If PARTITION BY clause is excluded from the above query, then it will give results as
follows:
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, NTILE(3) OVER (ORDER BY SALARY)
AS BUCKETS FROM workers;

ENAME EID DEPTID DEPTNAME SALARY BUCKETS


John 11 301 Workshop 30,000 1
Reyon 16 305 Testing 30,000 1
Jerry 15 305 Testing 35,000 1
Niya 38 308 HR 45,000 2
Alice 18 305 Testing 45,000 2
Tom 24 301 Workshop 50,000 2
Bob 22 301 Workshop 51,000 3
Bobby 17 308 HR 58,000 3

2
CUME_DIST ()
• The SQL window function CUME_DIST() returns the cumulative distribution of a value within a partition
of values.
• The cumulative distribution of a value calculated by the number of rows with values less than or equal to
(<=) the current row’s value is divided by the total number of rows.
• N/totalrows
• where N is the number of rows with the value less than or equal to the current row value and total rows is the number of
rows in the group or result set. Function returns value having a range between 0 and 1.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, CUME_DIST() OVER (PARTITION BY DEPTNAME ORDER
BY SALARY) AS CUME_DIST_VALUE FROM workers;
ENAME EID DEPTID DEPTNAME SALARY CUME_DIST_VALUE
Niya 38 308 HR 45,000 0.5
Bobby 17 308 HR 58,000 1
Reyon 16 305 Testing 30,000 0.3333333333333333
Jerry 15 305 Testing 35,000 0.6666666666666666
Alice 18 305 Testing 45,000 1
John 11 301 Workshop 30,000 0.3333333333333333
Tom 24 301 Workshop 50,000 0.6666666666666666
Bob 22 301 Workshop 51,000 1 3
ROW_NUMBER ()
• The SQL window function ROW_NUMBER() is used to display a row number for each
row within a specified partition.
• SELECT ROW_NUMBER() OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS
ROW_NUM, DEPTNAME, DEPTID, SALARY, ENAME, EIDFROM workers;

ROW_NUM DEPTNAME DEPTID SALARY ENAME EID


1 HR 308 58,000 Bobby 17
2 HR 308 45,000 Niya 38
1 Testing 305 45,000 Alice 18
2 Testing 305 35,000 Jerry 15
3 Testing 305 30,000 Reyon 16
1 Workshop 301 51,000 Bob 22
2 Workshop 301 50,000 Tom 24
3 Workshop 301 30,000 John 11

4
AVG()
• A window function applies function across a set of table rows that are related to the current row.
• The window function does not cause rows to be clustered into a single output row; the rows maintain their separate
identities. The window function is able to access more than just the current row of the query result.
• To calculate average value of each partition, we can use window function AVG(). To calculate average salary in each
department, we can write the query as follows:
• SELECT AVG(SALARY) OVER (PARTITION BY DEPTNAME) AS AVG_SALARY, DEPTNAME, DEPTID, SALARY, ENAME, EID
FROM workers;
AVG_SALARY DEPTNAME DEPTID SALARY ENAME EID
51,500.0000 HR 308 45,000 Niya 38
51,500.0000 HR 308 58,000 Bobby 17
36,666.6667 Testing 305 35,000 Jerry 15
36,666.6667 Testing 305 45,000 Alice 18
36,666.6667 Testing 305 30,000 Reyon 16
43,666.6667 Workshop 301 30,000 John 11
43,666.6667 Workshop 301 50,000 Tom 24
43,666.6667 Workshop 301 51,000 Bob 22
5
AVG()
• Also, moving aggregate can be calculated by adding ORDER BY clause along with PARTITION BY in
window function with AVG().
• SELECT AVG(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS AVG_SALARY,
DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;

AVG_SALARY DEPTNAME DEPTID SALARY ENAME EID


58,000.0000 HR 308 58,000 Bobby 17
51,500.0000 HR 308 45,000 Niya 38
45,000.0000 Testing 305 45,000 Alice 18
40,000.0000 Testing 305 35,000 Jerry 15
36,666.6667 Testing 305 30,000 Reyon 16
51,000.0000 Workshop 301 51,000 Bob 22
50,500.0000 Workshop 301 50,000 Tom 24
43,666.6667 Workshop 301 30,000 John 11

6
SUM()
• The SUM() window function returns the sum of input column or the expression across input values in
each partition.
• For example, to calculate sum of salaries of workers in each department, we can write the query as
follows:
• SELECT SUM(SALARY) OVER (PARTITION BY DEPTNAME) AS SUM_SALARY, DEPTNAME, DEPTID, SALARY,
ENAME, EID FROM workers;

SUM_SALARY DEPTNAME DEPTID SALARY ENAME EID


103,000 HR 308 45,000 Niya 38
103,000 HR 308 58,000 Bobby 17
110,000 Testing 305 35,000 Jerry 15
110,000 Testing 305 45,000 Alice 18
110,000 Testing 305 30,000 Reyon 16
131,000 Workshop 301 30,000 John 11
131,000 Workshop 301 50,000 Tom 24
131,000 Workshop 301 51,000 Bob 22
7
SUM()
• If we want to calculate moving sum of salaries of each department, then we can add an ORDER
BY clause in the above query:
• SELECT SUM(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS
SUM_SALARY, DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;

SUM_SALARY DEPTNAME DEPTID SALARY ENAME EID


58,000 HR 308 58,000 Bobby 17
103,000 HR 308 45,000 Niya 38
45,000 Testing 305 45,000 Alice 18
80,000 Testing 305 35,000 Jerry 15
110,000 Testing 305 30,000 Reyon 16
51,000 Workshop 301 51,000 Bob 22
101,000 Workshop 301 50,000 Tom 24
131,000 Workshop 301 30,000 John 11

8
COUNT()
• The COUNT() window function counts the number of rows defined by the expression in
partition. To count employees in each department, we can write the query as follows:
• SELECT COUNT(ENAME) OVER (PARTITION BY DEPTNAME) AS COUNT_ENAME,
DEPTNAME,DEPTID, SALARY, ENAME, EID FROM WORKERS;

COUNT_ENAME DEPTNAME DEPTID SALARY ENAME EID


2 HR 308 45,000 Niya 38
2 HR 308 58,000 Bobby 17
3 Testing 305 35,000 Jerry 15
3 Testing 305 45,000 Alice 18
3 Testing 305 30,000 Reyon 16
3 Workshop 301 30,000 John 11
3 Workshop 301 50,000 Tom 24
3 Workshop 301 51,000 Bob 22

9
MIN() and MAX()
• The aggregate window functions MIN() and MAX() return the minimum and maximum values
of an expression within a specified window.
• The following query will return the maximum and minimum salaries of workers in each
department.
• SELECT DEPTNAME, DEPTID, SALARY, ENAME, EID, MAX(SALARY) OVER (PARTITION BY
DEPTNAME) AS MAX_SAL, MIN(SALARY) OVER (PARTITION BY DEPTNAME) AS MIN_SAL FROM
workers;
DEPTNAME DEPTID SALARY ENAME EID MAX_SAL MIN_SAL
HR 308 45,000 Niya 38 58,000 45,000
HR 308 58,000 Bobby 17 58,000 45,000
Testing 305 35,000 Jerry 15 45,000 30,000
Testing 305 45,000 Alice 18 45,000 30,000
Testing 305 30,000 Reyon 16 45,000 30,000
Workshop 301 30,000 John 11 51,000 30,000
Workshop 301 50,000 Tom 24 51,000 30,000
Workshop 301 51,000 Bob 22 51,000 30,000
10
LEAD()
• SQL LEAD() function has a capacity that gives admittance to a column at a predefined actual
counterbalance which follows the current row.
• For example, by utilizing the LEAD() function, from the current line, you can get information of the
following line, or the second line that follows the current line, or the third line that follows the current
line, etc.
• The LEAD() function syntax is given below:
• LEAD(return_value [,offset[, default ]])
OVER (
PARTITION BY ...
ORDER BY ...
)
• In the above syntax, return_value specifies the return value of the following row offsetting from the
current row. Offset represents the number of rows forward from the current row from which to access
data.
• The offset must be a nonnegative integer. If the offset is not specified, then it is set default to 1.
• When offset goes beyond the scope of the partition, then function returns default value. If the value is
not specified, then NULL is returned. 11
LEAD()
• The LEAD() function applies to the partitions that are created by the PARTITION BY clause. If PARTITION
BY clause is not used, then the whole result set is treated as a single partition.
• The sorting of the rows in each partition is done by the ORDER BY clause to which the LEAD() function
applies. The following query will extract the salary of the next person in the department, and if the next
person is not available in the list, then it will return a NULL value.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, LEAD(SALARY) OVER (PARTITION BY DEPTNAME
ORDER BY SALARY) AS NEXT_PERSON_SALARY FROM workers;

ENAME EID DEPTID DEPTNAME SALARY NEXT_PERSON_SALARY


Niya 38 308 HR 45,000 58,000
Bobby 17 308 HR 58,000 NULL
Reyon 16 305 Testing 30,000 35,000
Jerry 15 305 Testing 35,000 45,000
Alice 18 305 Testing 45,000 NULL
John 11 301 Workshop 30,000 50,000
Tom 24 301 Workshop 50,000 51,000
Bob 22 301 Workshop 51,000 NULL
12
LEAD()
• The LEAD() function can also be very useful for calculating the difference between the value of the
current row and the value of the following row.
• The following query finds the difference between the salaries of person in the same department.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, LEAD(SALARY) OVER (PARTITION BY DEPTNAME
ORDER BY SALARY)-SALARY AS SALARY_DIFFERENCE FROM workers;

ENAME EID DEPTID DEPTNAME SALARY SALARY_DIFFERENCE


Niya 38 308 HR 45,000 13,000
Bobby 17 308 HR 58,000 NULL
Reyon 16 305 Testing 30,000 5,000
Jerry 15 305 Testing 35,000 10,000
Alice 18 305 Testing 45,000 NULL
John 11 301 Workshop 30,000 20,000
Tom 24 301 Workshop 50,000 1,000
Bob 22 301 Workshop 51,000 NULL

13
FIRST_VALUE()
• The SQL window function FIRST_VALUE() returns the first value in an ordered group of a result set or
window frame.
• The following query returns the first salary value in each department ordered by salary.
• SELECT FIRST_VALUE(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS FIRST_ROW,
DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;

FIRST_ROW DEPTNAME DEPTID SALARY ENAME EID


58,000 HR 308 58,000 Bobby 17
58,000 HR 308 45,000 Niya 38
45,000 Testing 305 45,000 Alice 18
45,000 Testing 305 35,000 Jerry 15
45,000 Testing 305 30,000 Reyon 16
51,000 Workshop 301 51,000 Bob 22
51,000 Workshop 301 50,000 Tom 24
51,000 Workshop 301 30,000 John 11

14
LAST_VALUE()
• The SQL window function LAST_VALUE() returns the last value in an ordered group of a result set.
• LAST_VALUE() function used in SQL server is a type of window function that results the last value in an
ordered partition of the given data set.
• The following query returns the last salary value in each department ordered by salary.
• SELECT LAST_VALUE(SALARY) OVER (PARTITION BY DEPTNAME ORDER BY SALARY DESC) AS LAST_ROW,
DEPTNAME, DEPTID, SALARY, ENAME, EID FROM workers;

LAST_ROW DEPTNAME DEPTID SALARY ENAME EID


58,000 HR 308 58,000 Bobby 17 • FIRST_VALUE is the same and
45,000 HR 308 45,000 Niya 38 equal to the value in the first row
45,000 Testing 305 45,000 Alice 18 for the entire result set.
• While the LAST_VALUE changes
35,000 Testing 305 35,000 Jerry 15
for each record and is equal to the
30,000 Testing 305 30,000 Reyon 16 last value that was pulled (i.e.
51,000 Workshop 301 51,000 Bob 22 current value in the result set).
50,000 Workshop 301 50,000 Tom 24
30,000 Workshop 301 30,000 John 11
15
LAST_VALUE()
• The following is a sort of scoreboard where each person has their own set of points. To know where they stand,
each row must have a low and high score associated with it.
• SELECT IdCol, vcName, iScore,
LAST_VALUE(iScore)
OVER (ORDER BY iScore DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as
LowestiScore,
FIRST_VALUE(iScore)
OVER (ORDER BY iScore DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as
HighestiScore
ID NAME SCORE
FROM tblEmpScores;
1011 Scott 2100
1012 Peter 2220
1013 John 2010
Employee Scores Table 1014 George 2009
1015 Thomas 2500
1016 Veronica 2110
1017 Anthony 2011 16
LAST_VALUE()
• If we want the Last Value to remain the same for all rows in the result set we need to use
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING with the
LAST_VALUE function.
• UNBOUNDED PRECEDING means that the starting boundary is the first row in the partition,
and UNBOUNDED FOLLOWING means that the ending boundary is the last row in the
partition.
.
ID NAME SCORE LowestiScore HighestiScore
1011 Scott 2100 2009 2500
1012 Peter 2220 2009 2500
1013 John 2010 2009 2500
1014 George 2009 2009 2500
1015 Thomas 2500 2009 2500
1016 Veronica 2110 2009 2500
1017 Anthony 2011 2009 2500

17
LAG()
• We can use a SQL window function LAG() to access previous row’s data based on
defined offset value. It works similar to a LEAD() function.
• In the SQL LEAD() function, we access the values of subsequent rows, but in LAG()
function, we access previous row’s data.
• It is useful to compare the current row value from the previous row value.
• SELECT ENAME, EID, DEPTID, DEPTNAME, SALARY, LAG(SALARY) OVER (PARTITION BY
DEPTNAME ORDER BY SALARY) AS PREVIOUS_PERSON_SALARY FROM workers;
• The above query finds the salary of the previous person in each department based on
logically sorted salary value.
• As no previous row is available for the first row in each department, it returns a NULL
value.

18
LAG()
ENAME EID DEPTID DEPTNAME SALARY PREVIOUS_PERSON_SALARY
Niya 38 308 HR 45,000 NULL
Bobby 17 308 HR 58,000 45,000
Reyon 16 305 Testing 30,000 NULL
Jerry 15 305 Testing 35,000 30,000
Alice 18 305 Testing 45,000 35,000
John 11 301 Workshop 30,000 NULL
Tom 24 301 Workshop 50,000 30,000
Bob 22 301 Workshop 51,000 50,000

19
Preparing Data from Analytics Tool
• One of the primary steps performed for data science is the cleaning of the
dataset you are working with.
• Various SQL queries can be used to clean, update, and filter data, by eliminating
redundant and unwanted records. This can be done with the different SQL
clauses like CASE WHEN, COALESCE, NULLIF, LEAST/GREATEST, Casting, and
DISTINCT.
Sales Table
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

20
CASE
WHEN
• The CASE statement goes through various conditions specified with WHEN clause and returns a
value when the first condition is met.
• It works like nested IF-THEN-ELSE statement. Once a condition is true, it will return the value
specified after THEN. Value in the ELSE clause is returned, if no conditions are true.
• It returns NULL when no conditions are true, and no ELSE part is specified in the query.
• Suppose we fetch all data of the above sales table and want to add an extra column that labels as
summary which categorizes sales into More, Less, and Avg, this table can be created using a CASE
statement as follows:
• SELECT *,
CASE
WHEN quantity >= 10 THEN 'More’
WHEN quantity >= 6 THEN 'Avg’
ELSE 'Less’
END AS summary
FROM sales; 21
CASE
WHEN
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

sale_no product_id quantity price customer_name summary


5,001 3 4 21,000 John Less
5,002 11 NULL 17,000 Anna Less
5,003 94 10 105,000 Tom More
5,004 86 8 27,000 Nora Avg
5,005 88 18 8,000 Tom More

22
COALESCE
• Some records of database may consist of NULL values, but while applying
statistics to these datasets, you may need to replace these NULL values with
some other data. This can be done effectively by the COALESCE function.
• The first parameter to this function is a column that may consist of NULL, and the
second represents value that replaces NULL.
• It replaces all NULL values specified in column by the second default value given
in the function.
• The following example replaces NULL by −1 in the quantity column.
• SELECT
customer_name ,product_id,
COALESCE(quantity, -1) AS quantity
FROM sales;
23
COALESCE
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

customer_name product_id quantity


SELECT John 3 4
customer_name ,product_id, Anna 11 -1
COALESCE(quantity, -1) AS quantity Tom 94 10
FROM sales;
Nora 86 8
Tom 88 10

24
NULLIF
• NULLIF function takes two parameters and will return NULL if the first parameter
value equals the second value else returns the first parameter.
• As an example, imagine that we want to replace product_id value 11 by NULL.
This could be done with the following query:
• SELECT sale_no, customer_name,
NULLIF(product_id, 11) AS product_id
FROM sales;

25
NULLIF
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

sale_no customer_name product_id


SELECT sale_no, customer_name, 5,001 John 3
NULLIF(product_id, 11) AS product_id 5,002 Anna NULL
FROM sales; 5,003 Tom 94
5,004 Nora 86
5,005 Tom 88

26
LEAST/GREATEST
• The LEAST and GREATEST are frequently used functions for data cleaning.
• These functions return the least and greatest values from the given set of
elements, respectively. These functions are useful to replace value in list,
especially if it is too high or low.
• For example, minimum price needs to be 10,000 in the above table. This can be
done by the following query.
• Price 8,000 is replaced by value 10,000 in the last row, as 8,000 is less than
10,000, and it replaces it by max value among these two.
• SELECT
sale_no, product_id, quantity,
GREATEST(10000, price) as price
FROM sales;
27
GREATEST
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

sale_no product_id quantity price


SELECT 5,001 3 4 21,000
sale_no, product_id, quantity, 5,002 11 NULL 17,000
GREATEST(10000, price) as price 5,003 94 10 105,000
FROM sales;
5,004 86 8 27,000
5,005 88 18 10,000

28
LEAST
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

sale_no product_id quantity price


SELECT 5,001 3 4 10,000
sale_no, product_id, quantity, 5,002 11 NULL 10,000
LEAST(10000, price) as price 5,003 94 10 10,000
FROM sales;
5,004 86 8 10,000
5,005 88 18 8,000

29
DISTINCT
• The DISTINCT keyword returns only distinct values in the specified column value
sets.
• For example, to extract all the unique names in the sales table, you could write
the following query:
• SELECT
DISTINCT customer_name
FROM sales;
• DISTINCT clause can also be applied to multiple columns to get the distinct
combinations of the specified column.
• The above query gives the following result: It removed duplicate names from the
customer_name column.

30
DISTINCT
sale_no product_id quantity price customer_name
5,001 3 4 21,000 John
5,002 11 NULL 17,000 Anna
5,003 94 10 105,000 Tom
5,004 86 8 27,000 Nora
5,005 88 18 8,000 Tom

customer_name
SELECT John
DISTINCT customer_name Anna
FROM sales; Tom
Nora

31
Advanced NoSQL for Data
Science
• NoSQL, which means “not only SQL”, is an alternative to relational databases in which data is
stored in tables and has a fixed data schema.
• NoSQL is a database design that can accommodate various data models, including key-value,
document, columnar, and graph formats.
• NoSQL databases are very useful for working with large distributed data.
• The NoSQL databases are built in the early 2000s to deal with large-scale database clustering in
web and cloud applications.
• NoSQL has a flexible schema, unlike the traditional relational database model. All rows can have
different structures or attributes.
• NoSQL databases are found to be very useful for handling really big data tasks because it follows
the Basically Available, Soft State, Eventual Consistency (BASE) approach instead of Atomicity,
Consistency, Isolation, and Durability − commonly known as ACID properties.
• Two major drawbacks of SQL are rigidity when adding columns and attributes to tables and
slow performance when many tables need to be joined and when tables store a large amount
of data.
• NoSQL databases tried to overcome these two biggest drawbacks of relational databases.
• NoSQL offers a more flexible, schema-free solution that can work with unstructured data. 32
Why NoSQL?
• NoSQL supports unstructured data or semi-structured data.
• In many applications, an attribute usually needs to be added on the fly, for
specific rows, but not every row, and may be of different types than attributes in
the rows.
• Now let us explore some NoSQL features to understand why you should choose
NoSQL databases for data science.
• Features:
• It is not using the relational model to store data.
• NoSQL running well on clusters.
• It is mostly open-source.
• NoSQL is capable to handle a large amount of social media data.
• NoSQL is schema-less.

33
Document Databases for Data
Science
• Document-based NoSQL databases store the data in the JSON object format. Each
document has key-value pairs like structures.
• The document-based NoSQL databases are simple for engineers as they map
items as a JSON object.
• JSON is a very common data format truly adaptable by web developers and
permits us to change the structure whenever required.
• Some examples of document-based NoSQL databases are CouchDB, MongoDB,
OrientDB, and BaseX.

34
JSON Document Format
{
"_id": 1,
"name" : { "first" : "John", "last" : "Backus" },
"contribs" : [ "Fortran", "ALGOL", "Form", "FP" ],
"awards" : [
{
"award" :"Dowell Award",
"year" : 1988,
"by" :"Computer Society"
},
{
"award" :"First Prize",
"year" : 1993,
"by" : "National Academy of Engineering“
}
]
}
35
Wide Column Databases for Data Science
• Similar to any relational database, this wide-column database stores the data in records, but it
can also store very large numbers of dynamic columns.
• It groups the dynamically added columns into column families.
• Instead of having multiple tables like relational databases, we have multiple column families in
wide-column databases.
• Examples of wide-column types of databases are Cassandra and Hbase.

Pattern for wide-column database.

36
Graph Databases for Data Science
• Graph database stores the data in the form of nodes and edges.
• The node stores information about the main entities like people, places, and products, and the
edge stores the relationships between them.
• Graph database is very useful to find out the pattern or relationship among data like a social
network and recommendation engines.
• Examples of graph databases are Neo4j and Amazon Neptune.

Simple pattern for graph database.


37

You might also like