Databricks - Data Analyst
Databricks - Data Analyst
PracticeEx
am-
PracticeEx Real
am- Databricks
Databricks SQL UI caching: Per user caching of all query and dashboard results in
the Databricks SQL UI.
During Public Preview, the default behavior for queries and query results is that
both the queries results are cached forever and are located within your
Databricks filesystem in your account. You can delete query results by re-
running the query that you no longer want to be stored. Once re-run, the old
query results are removed from cache.
Query results caching: Per cluster caching of query results for all queries
through SQL warehouses.
To disable query result caching, you can run SET use_cached_result = false in the
SQL editor.
If the dashboard previously had an owner, that user no longer has the Can
Manage permission on the dashboard. The user you gave the Can Manage
permission is now the owner.
[ LEFT ] SEMI
Returns values from the left side of the relation that has a match with
the right. It is also referred to as a left semi join.
○ [ LEFT ] ANTI
Returns values from the left relation that has no match with the right.
It is also referred to as a left anti join.
Used for a variety of tasks, such as querying data, controlling access to the
database and its objects, guaranteeing database consistency, updating rows in a
table, and creating, replacing, altering and dropping objects, SQL lets users work
with data at the logical level.0
Dashboard refresh interval Dashboards do not support which of the following options
1min – 1 week by default 1. Borders
2. Customize tooltips
3. Customize labels
Edit widgets
Advanced
https://docs.databricks.com/sql/user/queries/query-parameters.html
Who use databricks sql as secondary use? Query is scheduled 4 hours interval
1. Business intelligence analyst But the endpoints is taking time to start
2. Business analyst What should be done while managing costs
3. Data analyst 1. Increase the cluster size
4. Data engineering 2. Decrease the cluster size
1. Use larger clusters. It may sound obvious, but this is the number one
problem we see. It’s actually not any more expensive to use a large cluster
for a workload than it is to use a smaller one. It’s just faster. If there’s
anything you should take away from this article, it’s this. Read section 1.
Really.
2. Use Photon, Databricks’ new, super-fast execution engine. Read section 2
to learn more. You won’t regret it.
3. Clean out your configurations. Configurations carried from one Apache
Spark™ version to the next can cause massive problems. Clean up! Read
section 3 to learn more.
4. Use Delta Caching. There’s a good chance you’re not using caching
correctly, if at all. See Section 4 to learn more.
5. Be aware of lazy evaluation. If this doesn’t mean anything to you and
you’re writing Spark code, jump to section 5.
6. Bonus tip! Table design is super important. We’ll go into this in a future
blog, but for now, check out the guide on Delta Lake best practices.
Every minute data refresh from steaming dataset Insert into syntax
What should analyst say as a concern
Options
1. Streaming dataset doesn't support fault tolerance 1. Wrong syntax – syntax was correct
2. It will be costly 2. Append the data including duplicates
3.
INSERT { OVERWRITE | INTO } [ TABLE ] table_name
[ PARTITION clause ]
[ ( column_name [, ...] ) ]
query