0% found this document useful (0 votes)
9 views

SQL DM1

The document discusses methods for predicting email communication using feature engineering and SQL techniques. It covers SQL concepts such as Common Table Expressions, subqueries, and the order of execution for SQL queries, along with practical examples. Additionally, it introduces big data technologies like PostgreSQL, Hadoop, data warehouses, and HDFS, highlighting their functionalities and differences.

Uploaded by

Sania Solad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

SQL DM1

The document discusses methods for predicting email communication using feature engineering and SQL techniques. It covers SQL concepts such as Common Table Expressions, subqueries, and the order of execution for SQL queries, along with practical examples. Additionally, it introduces big data technologies like PostgreSQL, Hadoop, data warehouses, and HDFS, highlighting their functionalities and differences.

Uploaded by

Sania Solad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Case study

How would you predict who someone may want to send a Snapchat or
Gmail to?

For each user, assign a score of how likely someone would send an email to
the rest is feature engineering:
-Number of past emails
-How many responses
-The last time they exchanged an email
-Whether the last email ends with a question mark
-Features about the other users, etc.
-People who someone sent emails the most in the past conditioning on time
decay

SQL

1.Common Table Expressions (CTEs):

A CTE is a named temporary result set that you can reference within a
SELECT, INSERT, UPDATE, or DELETE statement.
CTEs make complex queries more readable and maintainable.
Use CTEs when you need to break down a complex query into smaller, more
understandable parts

2.Subqueries vs. JOINs:

-Subqueries are nested queries within another query and are used to retrieve
data for further processing.
-JOINs combine rows from two or more tables based on a related column.
-Use subqueries when you need to retrieve a single value or a small set of
values, and use JOINs when you need to combine data from multiple tables.

3.How do the SQL commands flow at the back end?

Ans.
Order of execution for an SQL query
1) FROM, including JOINs
2) WHERE
3) GROUP BY
4) HAVING
5) WINDOW Functions
6) SELECT
7) DISTINCT
8) UNION
9) ORDER BY
10) LIMIT AND OFFSET

4.Write a SQL query to find all the student names Nitin in a table

select name
from student
where lower like ‘%nitin%’

Now the trick is to make sure you convert the name in lower for the complete
column

wrong output

name like '%nitin%'

As this will not capture Nitin, niTin, etc.

5.Write a query to get all the student with name length 10, starting with K
and ending with z.

select name
from student
where length=10 and lower like ‘k%z’

7.ACID Properties

ACID stands for Atomicity, Consistency, Isolation, and Durability. These


properties ensure the reliability of database transactions.
Atomicity ensures that a transaction is treated as a single, indivisible unit.
Consistency guarantees that a transaction brings the database from one
consistent state to another.
Isolation ensures that transactions are executed independently.
Durability guarantees that once a transaction is committed, its effects are
permanent.

1. Question: Convert '2023-10-15' to '15-Oct-2023'.


Answer: You can use the TO_CHAR function to
format the date:

SELECT TO_CHAR(TO_DATE('2023-10-15', 'YYYY-MM-DD'), 'DD-Mon-


YYYY') AS formatted_date;

1. Calculate the date that is 90 days from today. Answer: Use


the CURRENT_DATE and INTERVAL for date arithmetic:

SELECT CURRENT_DATE + INTERVAL '90 days' AS future_date;

1. Determine the day of the week for '2023-11-20'.


Answer: Use the TO_CHAR function to extract the
day of the week

SELECT TO_CHAR(TO_DATE('2023-11-20', 'YYYY-MM-DD'), 'Day') AS


day_of_week;

1. Display 'N/A' for employees with no 'hire_date'.


Answer: Use the COALESCE function to provide a
default value for NULL dates:
2. Convert a timestamp from one time zone to another.
Answer: Use the AT TIME ZONE clause to perform
the conversion:

SELECT timestamp_column AT TIME ZONE 'UTC' AT TIME ZONE


'America/New_York' AS converted_timestamp FROM table_name;

Big Data Technologies

1.What is PostgreSQL?

PostgreSQL is an enterprise-level, versatile, resilient, open-source, object-


relational database management system that supports variable workloads and
concurrent users. The international developer community has constantly
backed it. PostgreSQL has achieved significant appeal among developers
because to its fault-tolerant characteristics.
It’s a very reliable database management system, with more than two
decades of community work to thank for its high levels of resiliency, integrity,
and accuracy. Many online, mobile, geospatial, and analytics applications
utilise PostgreSQL as their primary data storage or data warehouse.

2.What is Hadoop used for?

Apache Hadoop is an open-source framework that is used to efficiently store


and process large datasets ranging in size from gigabytes to petabytes of
data. Instead of using one large computer to store and process the data,
Hadoop allows clustering multiple computers to analyze massive datasets in
parallel more quickly.

3.What is Data Warehouse?

A data warehouse is a type of data management system that is designed to


enable and support business intelligence (BI) activities, especially analytics.
Data warehouses are solely intended to perform queries and analysis and
often contain large amounts of historical data. The data within a data
warehouse is usually derived from a wide range of sources such as
application log files and transaction applications.

4.What is the difference between Hive and Presto?

Hive is optimized for query throughput, while Presto is optimized for latency.
Presto has a limitation on the maximum amount of memory that each task in a
query can store, so if a query requires a large amount of memory, the query
simply fails

5.Define HDFS

HDFS stands for Hadoop Distributed File System. The Hadoop Distributed
File System (HDFS) is the primary data storage system used by Hadoop
applications. HDFS employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to
data across highly scalable Hadoop clusters.
With HDFS, data is written on the server once, and read and reused
numerous times after that. HDFS has a primary NameNode, which keeps
track of where file data is kept in the cluster.
Instagram Post - Click and
Follow

You might also like