MSC Datascience Unit1

Uploaded by

santoshsharma1194

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views20 pages

MSC Datascience Unit1

Uploaded by

santoshsharma1194

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

MSc IT Part 1 Sem I Data Science

Data science
Data science is an interdisciplinary science that incorporates scientific fields of data such as data
engineering, information science, computer science, statistics, artificial intelligence, machine learning,
data mining, and predictive analytics.
Data analytics is the science of fact-finding analysis of raw data, with the goal of drawing conclusions
from the data lake.
Machine learning is the capability of systems to learn without explicit software development. It
evolved from the study of pattern recognition and computational learning theory.

UNIT 1
Data Science Technology Stack
Describe data science technology stack

The Data Science Technology Stack covers the data processing requirements in the Rapid Information
Factory ecosystem.
The Rapid Information Factory ecosystem is a convention of techniques used for processing
developments.
Data Science Storage Tools
This data science ecosystem has a series of tools that you use to build your solutions
Schema-on-Write and Schema-on-Read
There are two basic methodologies that are supported by the data processing tools.
Schema-on-Write Ecosystems
A traditional relational database management system (RDBMS) requires a schema before you can
load the data. To retrieve data from my structured data schemas, you may have been running standard
SQL queries for a number of years. Benefits include the following:
• In traditional data ecosystems, tools assume schemas and can only work once the schema is
described, so there is only one view on the data.
• The approach is extremely valuable in articulating relationships between data points, so there are
already relationships configured.
• It is an efficient way to store “dense” data.
• All the data is in the same data store.
On the other hand, schema-on-write isn’t the answer to every data science problem. Among the
downsides of this approach are that
• Its schemas are typically purpose-built, which makes them hard to change and maintain.
• It generally loses the raw/atomic data as a source for future analysis.
• It requires considerable modeling/implementation effort before being able to work with the data.
• If a specific type of data can’t be stored in the schema, you can’t effectively process it from the
schema. At present, schema-on-write is a widely adopted methodology to store data.
Schema-on-Read Ecosystems
This alternative data storage methodology does not require a schema before you can load the data.
Fundamentally, you store the data with minimum structure. The essential schema is applied during the
query phase.
Benefits include the following:
• It provides flexibility to store unstructured, semi-structured, and disorganized data.
• It allows for unlimited flexibility when querying data from the structure.