0% found this document useful (0 votes)
3 views

Data_science_note_

The document outlines a course on Data Science, detailing its structure, evaluation methods, and key terminologies. It covers the definition of data, the data science life cycle, modern data ecosystems, and various tools and techniques used in data analysis. Additionally, it emphasizes the importance of transforming data into actionable insights for informed decision-making.

Uploaded by

lanusha820
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data_science_note_

The document outlines a course on Data Science, detailing its structure, evaluation methods, and key terminologies. It covers the definition of data, the data science life cycle, modern data ecosystems, and various tools and techniques used in data analysis. Additionally, it emphasizes the importance of transforming data into actionable insights for informed decision-making.

Uploaded by

lanusha820
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Foundation of Data Science

Course Code: CT-653


(Module#1)
Dhawa Sang Dong, MSc Eng.
(Lecturer)

Kathford International College of


Engineering and Management
Balkumari, Lalitpur
November 2024
Chapter#1
Introduction to Data Science

✓ Class Outline
1 Introduction to Data Science
2 Terminologies in Data Science
3 Modern Data Ecosystem
4 Data Science Life Cycle
5 Trends, markets and applications of data science
6 Tools and Technologies in Data Science
7 Data Scientist and their Roles
Course Evaluation
Theory (100)
I Internal weight (40/100)
- Assignments (in total 7) for each module [ 3 x 7 = 21 ]
- Class Activities and average of 2-Test [ 4 + (5 + 10) = 19 ]
II External weight (60/100)
- End Semester Exam by IOE, TU

Practical (50)
- Lab Attendance (0.5 per Lab) and Viva [ 5 + 1.5 x 10 = 20 ]
- Lab Report and Minor Data Science Project [ 10 + 20 = 30 ]

Reference Books
1 Principle of Data Science – Sinan Ozdemir, 2016 (Packt
Publishing)
2 Data Science from Scratch: First Principles with Python, Joel
Grus, 2017 (O’Reilly Media)
3 Introduction to Data Science – Laura Igual, 2017 (Springer)
Introduction to Data Science

What is Data? | Data Science


➤ Data is simply collection of information in either an
organized format or in unorganized format.

➠ Organized Data: this refers to data that is sorted into a


row/columns structure;
- every row represents a single observation and the columns
represents the characteristics of that observation.

➠ Unorganized Data: this is the type of data that is in the


free form, usually text or raw audio/signals that must be
parsed further to become organized.
Introduction to Data Science
What is Data Science?
➠ Data Science is all about how we take data, use it to acquire
knowledge, and then use that knowledge to do the following:
✔ make decisions
✔ predict the future
✔ understand the past/present
✔ create new products/industries

➠ Fundamentally, data science includes Math/Statistics,


Computer Programming, and Domain Knowledge
Data Model refers to an organized and formal relationship
between elements of data, usually meant to simulate a
real-world phenomenon;
Usually, math is used to formalize relationship between
variables.
Introduction to Data Science

What is Data Science?


✔ Data science is the field of exploring, manipulating, and
analyzing data, and use knowledge/insights to answer questions
or make recommendations or decision making.
✔ Data Science is the study of handling and extracting
meaningful insights from large data sets using modern tools
and algorithms.
✔ The meaningful insights drawn from the data help us in
decision-making.
Introduction to Data Science
What is Data Science?

- Data alone holds limited value unless it is transformed into


actionable insights – essential for informed decision-making and
improving processes like design and manufacturing.
- Data Science Tools and Algorithms enable various data mining
tasks (✔descriptive, ✔predictive, ✔diagnostic, and
✔prescriptive analytics) providing insights within the data (past
and present) trends, forecast future outcomes, identify root
causes, and recommend potential actions.
Introduction to Data Science
What is Data Science?
➠ Data science combines ✔math and statistics,
✔specialized programming, ✔advanced analytics,
✔artificial intelligence (AI) and ✔machine learning with
specific ✔subject matter expertise to uncover actionable
insights hidden in an organization’s data.

➠ These insights can be used to guide decision making and


strategic planning.
Terminologies in Data Science

Terminologies in Data Science


Here, we will discuss some key words/terminologies/jargon related
to field of data science – a data analyst or data scientist may use

Algorithm: | Terminologies in Data Science


- An algorithm is a sequence of steps or guidelines designed to
accomplish a particular task.
- They are especially valuable when dealing with big data or
machine learning.
- Data analysts often use algorithms to structure or examine
data, while data scientists employ them to make forecasts or
create models.
Terminologies in Data Science

Artificial Intelligence (AI): | Terminologies in Data Science


¬ “Artificial Intelligence is the science and engineering of making
machine intelligent.” – John McCarthy, Father of AI
(Darthmouth Conference 1956)

- Artificial intelligence (AI) uses algorithms and vast datasets


from computer science to enable machines to perform tasks
that typically require human intelligence, such as recognizing
patterns, making decisions, and solving complex problems.

- The intelligence is considered "artificial" because the computer


is programmed (explicitly) to carry out tasks typically linked to
human cognitive functions.
Terminologies in Data Science

Big Data: | Terminologies in Data Science


¬ Big data is a vast set of information defined by the three V’s:
volume, velocity, and variety.

➠ Volume relates to the high amount of data – big data involves


handling large quantities;
➠ velocity refers to the speed at which data is generated and
gathered – big data is collected rapidly, often streaming directly
into memory; and
➠ variety highlights the diversity of data types – big data
encompasses a wide range of structured, semi-structured, and
unstructured data, as well as various formats like numbers, text,
images, and audio.
Terminologies in Data Science

Business intelligence (BI): | Terminologies in Data Science


➤ Business intelligence (BI) involves use of data analytics to
help organizations make informed, data-driven decisions.
➤ BI analysts examine business data such as revenue, sales, or
customer information and provide recommendations based on
their findings.
➤ By leveraging BI tools and techniques, businesses can identify
trends, uncover insights, and optimize operations.
Terminologies in Data Science

Changelog: | Terminologies in Data Science


¬ A changelog is a record or log of all the changes made to a
project, software, or system over the time.

➠ It typically includes details about new features, bug fixes,


improvements, updates, or other modifications in each version
or release.

➠ Changelogs are useful for developers, users, and stakeholders


to track the evolution of a project and understand what has
been altered between different versions.
Terminologies in Data Science

Classification: | Terminologies in Data Science


¬ Classification is a type of machine learning task that sorts data
into predefined categories.
➠ It can be applied, for instance, to develop email spam filters.
➠ Common algorithms used to build classification models
include logistic regression, decision trees, K-nearest neighbors
(KNN), and random forests.

Dashboard: | Terminologies in Data Science


¬ A dashboard is a tool used to monitor and display live data.
➠ dashboards are typically connected to databases and feature
visualizations that automatically update to reflect the most
current data in the database.
Terminologies in Data Science

Data Analytics: | Terminologies in Data Science


✔ Data analytics involves gathering, transforming, and organizing
data to draw insights, make predictions, and support informed
decision-making.

➠ It includes data analysis (extracting meaningful information


from data), data science (using data to hypothesize and predict),
and data engineering (developing data systems).

➠ Professionals in this field include data analysts, data scientists,


and data engineers, all contributing to different aspects of data
analytics to draw insights from the data.
Jargon of Data Science

Data Analytics: | Terminologies in Data Science


¬ There are four key types of data analytics, including:

✔ Descriptive analytics, ➠ what happened.

✔ Diagnostic analytics, ➠ why something happened.

✔ Predictive analytics, ➠ what will likely happen in the future.

✔ Prescriptive analytics, ➠ how to act


Terminologies in Data Science

Data Architecture: | Terminologies in Data Science


➠ Data architecture, or data design, is the strategic framework
for an organization’s data management system.

➠ It covers every stage of the data lifecycle, including data


collection, organization, usage, and disposal.

➠ Data architects are responsible for designing the plans that


guide how these systems are built and maintained.
Terminologies in Data Science

Data Cleaning: | Terminologies in Data Science


➠ Data cleaning, also known as data cleansing or scrubbing, is
the process of preparing raw data for analysis.

➠ This involves ensuring the data is accurate, complete,


consistent, and free of bias.

➠ Clean data is essential before performing any analysis, as


unclean or flawed data can result in incorrect conclusions and
poor business decisions.
Terminologies in Data Science

Data Engineering: | Terminologies in Data Science


➠ Data engineering involves creating systems that make data
accessible for analysis.

➠ Data engineers are responsible for building systems that


gather, manage, and transform raw data into usable insights.

➠ Their tasks often include developing algorithms to process data


into a more practical format, constructing database pipelines,
and designing new tools for data analysis.
Terminologies in Data Science

Data Enrichment: | Terminologies in Data Science


➠ Data enrichment involves augmenting your existing dataset
with additional information.
➠ This process usually occurs during data transformation as you
prepare for analysis, particularly if you identify the need for
more data to effectively address your business questions.

Data Governance: | Terminologies in Data Science


➠ Data governance refers to the structured framework for how an
organization oversees its data management.
➠ It includes guidelines for data access and usage, as well as
rules related to accountability and compliance.
Terminologies in Data Science

Data Lake: | Terminologies in Data Science


➠ A data lake is a storage repository designed to collect and
retain vast amounts of structured, semi-structured, and
unstructured raw data.
➠ Data scientists utilize the data stored in data lakes for machine
learning or AI algorithms and models, or they may process the
data and transfer it to a data warehouse.

Data Mart: | Terminologies in Data Science


➠ A data mart is a smaller segment of a data warehouse that
contains all processed data relevant to a specific department.
➠ While a data warehouse might encompass information related
to finance, marketing, sales, and human resources, a data mart
focuses specifically on data pertinent to the finance team.
Terminologies in Data Science
Data Mining: | Terminologies in Data Science
➠ Data mining involves thoroughly analyzing data to uncover
patterns and extract insights.
➠ It is a key component of data analytics, as the insights gained
during the mining process will guide your business decision or
recommendations.

Data Modeling: | Terminologies in Data Science


➠ Data modeling is the process of creating maps and
constructing data pipelines that link data sources for analysis.
➠ A data model serves as a tool to implement these pipelines
and organize data across various sources.
➠ Data modelers are systems analysts who collaborate with data
architects and database administrators to design databases and
data systems.
Terminologies in Data Science

Data Visualization: | Terminologies in Data Science


➠ Data visualization is the process of presenting information and
data through charts, graphs, maps, and other visual aids.

➠ Effective data visualizations can enhance storytelling, make


data more accessible to a broader audience, reveal patterns
and relationships, and facilitate deeper exploration of the data.
Terminologies in Data Science

Data Wrangling: | Terminologies in Data Science


➠ Data wrangling, also known as data munging or data
remediation which involves transforming raw data into a usable
data format.
➠ The wrangling process consists of four stages:
✔ discovery,
✔ data transformation,
✔ data validation, and
✔ data publishing.

➠ The data transformation stage can be further divided into


tasks such as data structuring, normalization or
denormalization, cleaning, and enrichment.
Terminologies in Data Science

Data Warehouse: | Terminologies in Data Science


➠ A data warehouse is a centralized storage system that holds
processed and organized data from various sources.
➠ It may include a mix of current and historical data that has
been extracted, transformed, and loaded from both internal
and external databases.

More Jargon: | Data Science Jargon


✓ Data base, ✓ deep learning, ✓ machine learning,
✓ reinforcement learning, ✓ structured data, ✓ regression,
✓ structure query language (SQL), ✓ supervised learning,
✓ unsupervised learning, ✓ unstructured data
Modern Data Ecosystem

What is Modern Data Ecosystem?


✍ A modern data ecosystem refers to the integrated set of
technologies, practices, and processes that the organizations
use to collect, store, process, analyze, and visualize data.
✍ The term data ecosystem refers to programming language,
packages, algorithms, cloud-computing services, general
infrastructure that the organization uses to collect, store,
analyze, and leverage data – Harvard business school.
✍ This ecosystem is designed to handle the complexities of
today’s data landscape, which includes vast amounts of
structured and unstructured data generated from various
sources, including IoT devices, social media, transactional
systems, and more.
Modern Data Ecosystem

Some Key Elements of Modern Data Ecosystem:


1 Data Sources
2 Data Integration
3 Data Storage
4 Data Processing
5 Data Governance
6 Data Analytics Tools
7 Data Analysis Techniques
8 Data Visualization
9 Data Democratization
10 Advanced Data Analytics
Modern Data Ecosystem

Data Sources: | Elements of Modern Data Ecosystem


- these are the systems, applications, and devices that generates
or collect data for an organization.
- Data Sources can include ✔customer relationship management
(CRM), ✔web applications, ✔transactional databases, ✔social
media platform or Internet of Things (IOT) devices and more.

Data Integration: | Elements of Modern Data Ecosystem


- Data from different sources often needs to be integrated into a
unified format for analysis
- data integration tools and techniques are used to extract,
transform and load (ETL) data from diverse sources into a
centralized data repository or data lake.
Modern Data Ecosystem

Data Storage: Elements of Modern Data Ecosystem


- A modern data ecosystem typically involves storing data in
scalable and flexible formats.
- this can include traditional relational database (like postgreSQL
or MySQL), cloud based data warehouses (like Amazon
Redshift or Google BigQuery) or distributed file systems like
Apache Hadoop.

Data Processing: Elements of Modern Data Ecosystem


- to analyze large volumes of data efficiently, distributed
computing frameworks like ✔Apache Spark or ✔Apache
Hadoop MapReduce are commonly used.
- These frameworks enable parallel processing of data across
clusters of computers, allowing for high-performance data
processing and analytics.
Modern Data Ecosystem
Data Governance: Elements of Modern Data Ecosystem
- data governance ensures the availability, integrity, privacy, and
security of data within an organization.
- it involves establishing policies, processes, and controls to
manage data effectively, comply with regulations, and maintain
data quality.
- data governance frameworks and tools help organizations
ensure the reliability of their data analytics processes.

Data Analytics Tools: Elements of Modern Data Ecosystem


- a wide range of tools and technologies exist for data analytics,
including programming languages (python or R), statistical
packages, business intelligence (BI) tools, data visualization
tools (tableau or power BI), and machine learning platform.
- these tools enable organizations to extract insights, discover
patterns and make data-driven decisions.
Modern Data Ecosystem

Data Analysis Techniques: Modern Data Ecosystem Elements


- data analytics encompasses various techniques, including
✔ descriptive analytics (summarizing, historical data),
✔ predictive analytics (forecasting future outcomes), and
✔ prescriptive analytics (providing recommendations).
- Organizations employ these techniques to gain insights, detect
anomalies, predict trends, and optimize business processes.

Data Visualization: Elements of Modern Data Ecosystem


- data visualization is crucial for effectively communicating
insights and findings from data analysis.
- modern data ecosystems leverage interactive and intuitive
visualization tools to present data in meaningful ways, enabling
stakeholders to understand and interpret the results easily.
Modern Data Ecosystem

Data democratization: Elements of Modern Data Ecosystem


➠ data democratization aims to make data and analytics
accessible to a wider audience within an organization.
➠ this involves providing self-service ✔ analytics capabilities,
✔ intuitive dashboards, and ✔ tools that empower business
users and domain experts to explore and analyze data without
heavy reliance on IT or data science team.
Modern Data Ecosystem

Advanced Analytics: Elements of Modern Data Ecosystem


➠ advanced analytics techniques such as machine learning,
artificial intelligence (AI) are increasingly being incorporated
into modern data ecosystems.
➠ these techniques enable organizations to leverage complex
algorithms and models to uncover patterns, make predictions,
automate recommendation & decision-making processes, and
gain competitive advantage.
Data Science Life Cycle

Data Science Life Cycle

Fig. 1 Data Science Life Cycle


Data Science Life Cycle

Discovery: | Data Science Life Cycle


➤ Understand the problem you’re trying to solve and the business
objectives – understanding the business problem

➠ In this phase, you will gather requirements and define key


metrics for success.
✔ Identify the problem.
✔ Understand the business context and goals.
✔ Define the data science objectives and questions.
✔ Specify success/performance criteria (metrics).
✔ Formulate a hypothesis to be tested with data.
Data Science Life Cycle

Data Preparation: | Data Science Life Cycle


➤ Collect, clean, and organize relevant data needed to address
the problem.

✔ Data acquisition: from internal databases, external APIs,


web scraping, etc.
✔ Data cleaning: handle missing values, correct
inconsistencies, remove duplicates.
✔ Data transformation: standardization, normalization, and
encoding of variables.
✔ Data integration: from various sources.
✔ Exploratory Data Analysis(EDA): to understand data
distribution and relationships.
Data Science Life Cycle

Model Plan: | Data Science Life Cycle


➠ Design a plan for model building by selecting algorithms and
creating a workflow for model development.

✔ Choose appropriate modeling techniques based on the data


and problem (e.g., regression, classification, clustering).
✔ Split the dataset into training, validation, and test sets.
✔ Decide on performance metrics to evaluate the model (e.g.,
accuracy, precision, recall, RMSE).
✔ Create a roadmap for feature selection, model iteration, and
testing.
Data Science Life Cycle

Model Development: | Data Science Life Cycle


➠ Build and train the model on the prepared data.

✔ Develop models using the chosen algorithms.


✔ Train the model on the training data.
✔ Fine-tune hyperparameters to optimize performance.
✔ Use cross-validation to avoid overfitting.
✔ Feature engineering, if needed, to enhance model
performance.
Data Science Life Cycle

Operationalize/Deployment: | Data Science Life Cycle


➠ Deploy the model into production and integrate it into the
business environment where it can provide predictions.

✔ Implement the model in a live system or application.


✔ Set up pipelines for data flow, allowing real-time predictions
or periodic batch processing.
✔ Ensure scalability and performance in a real-world setting.
✔ Monitor the model’s behavior and response time.
Data Science Life Cycle

Communicate Results: | Data Science Life Cycle


➠ Present insights and results to stakeholders/audience to inform
decision-making or recommendation.

✔ Generate reports, dashboards, or visualizations that clearly


explain the model’s output and predictions.
✔ Provide actionable insights based on the model’s results.
✔ Communicate the business impact of the model’s findings.
✔ Suggest further iterations or improvements based on
feedback.
Trends, markets and applications of data science

Trends, Application and job Market: | Data Science


✔ Data science is revolutionizing industries with trends and
applications focused on automation, AI integration, and
enhanced decision-making.

Trends: | Data Science


➠ Artificial Intelligence and Machine Learning (AI/ML):
AI-powered models are advancing across fields like NLP
(Natural Language Processing), computer vision, and deep
learning, leading to automation in various sectors.
➠ Data Engineering and Real-time Analytics: As data grows,
real-time processing and data engineering are crucial for quick
insights, particularly in finance, e-commerce, and IoT.
Trends, markets and applications of data science
Trends, Application and job Market: | Data Science
✔ Data science is revolutionizing industries with trends and
applications focused on automation, AI integration, and
enhanced decision-making.

Trends: | Data Science


➠ MLOps and AI Governance: Managing machine learning
operations (MLOps) for model deployment, scaling, and
governance is on the rise to streamline workflows and ensure
compliance.
➠ Edge Computing: Processing data closer to the source is
enabling faster decision-making in IoT and autonomous
systems.
➠ Privacy-focused AI: With data privacy concerns, federated
learning and differential privacy are gaining attraction.
Trends, markets and applications of data science

Trends, Application and job Market: | Data Science


✔ Data science is revolutionizing industries with trends and
applications focused on automation, AI integration, and
enhanced decision-making.

Markets : | Data Science


➠ Healthcare: Predictive analytics, genomics, and personalized
medicine are transforming diagnostics and treatment plans.
➠ Finance and Banking: Risk assessment, fraud detection, and
algorithmic trading leverage data science for secure and
optimized financial services.
Trends, markets and applications of data science
Trends, Application and job Market: | Data Science
✔ Data science is revolutionizing industries with trends and
applications focused on automation, AI integration, and
enhanced decision-making.

Trends: | Data Science


➠ Retail and E-commerce: Customer behavior analysis,
demand forecasting, and personalized recommendations are
core to the retail sector’s data strategy.
➠ Manufacturing and Supply Chain: Predictive maintenance,
inventory management, and quality control benefit from
real-time analytics.
➠ Telecommunications: Improving network management,
churn prediction, and customer segmentation are key
applications.
Trends, markets and applications of data science

Trends, Application and job Market: | Data Science


✔ Data science is revolutionizing industries with trends and
applications focused on automation, AI integration, and
enhanced decision-making.

Applications: | Data Science


➠ Customer Segmentation and Personalization: Used in
marketing and e-commerce for targeted campaigns and
recommendations.
➠ Predictive Maintenance: Critical in manufacturing and
transportation to anticipate failures and reduce downtime.
Trends, markets and applications of data science

Trends, Application and job Market: | Data Science


✔ Data science is revolutionizing industries with trends and
applications focused on automation, AI integration, and
enhanced decision-making.

Applications: | Data Science


➠ Fraud Detection and Risk Analysis: Vital in finance, where
data science models detect anomalies and reduce risk.
➠ Healthcare Analytics: From disease prediction to operational
optimization, data science is transforming healthcare delivery.
➠ Natural Language Processing: NLP applications are
essential in virtual assistants, chatbots, and sentiment analysis
across sectors.
Trends, markets and applications of data science

Data Science Application and job Market: | Data Science


Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ To make use of data,
Raw data should be passed through data science tasks:
1 Data Management
2 Data Integration and Transformation
3 Data Visualization
4 Model Building
5 Model Deployment
6 Model Monitoring and Assessment

To perform above tasks explained, one need following:


✔ data asset management, ✔ code asset management,
✔ execution environments, and ✔ development environments
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data asset management
- It refers to the systematic process of
✔ organizing, ✔storing,
✔ securing, and ✔ maintaining
data as a valuable asset for an organization.

- It involves the practices and tools needed to ensure that data


is ✔ accessible, ✔ high-quality, ✔ reliable, and ✔ effectively
leveraged to support decision-making, analytics, and business
strategies, recommendations.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data Asset Management
- Here are some open source Data Asset Management tools:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Code asset management
- it involves ✔organizing, ✔tracking, and ✔maintaining the
code or scripts, and data models used in data projects.
- Effective code asset management ensures that the work is
✔ reproducible, ✔ version-controlled, and
✔ easy to collaborate on.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Code Asset Management
- Here are some open source Code Asset Management tools:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠Execution environments
- these are setups where code, models, and data workflows
➠ run, enabling experimentation, development, testing,
evaluation and production.

- Each environment type offers specific advantages depending


on the project’s needs, such as processing power, scalability,
and reproducibility.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Execution Environment
- Here are some open source Execution Environment:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Development environments
- these are platforms, tools, and setups used to
✔ write code,
✔ test code, and
✔ debug code for data projects.

- The right environment can


✔ streamline experimentation,
✔ improve productivity, and
✔ enhance collaboration.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Development Environment
- Here are some open source Development Environment:
Tools and Technologies in Data Science
Tools and Technologies: | Data Science
➠ Data Management
- it is the process of collecting, persisting, and retrieving data
securely, efficiently, and cost-effectively.
- Data is collected from many sources, like Twitter, Flipkart,
Media, Sensors, and more.
- Store collected data in persistent storage so it is available
whenever you need it.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data Management
- Here are some open source data management tools:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data Integration and Transformation
- It is the early process (on collected raw data) of Extracting,
Transforming, and Loading data. ➠ “ETL”.
- Some of this data is distributed in multiple repositories.
- For example, a database, a data cube, and flat files.
- Use the Extraction process to extract data from these
numerous repositories and save to a central repository like a
Data Warehouse.
- Data Warehouses are primarily used to collect and store
massive amounts of data for data analysis.
Tools and Technologies in Data Science
Tools and Technologies: | Data Science
➠ Data Integration and Transformation
- Next, Data Transformation is the process of transforming the
values, structure, and format of data.
- After extraction, the next step is to transform the data;
- And once the data is transformed, it’s time to load the data.
- Transformed data is loaded back to the Data Warehouse.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data Integration and Transformation
- Here are some open source data Integration and
Transformation tools:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data visualization
- It is the graphical representation of data and information.
- You can use visualization to represent data in the form of
charts, plots, maps, animations, etc.
- And data visualization conveys data more effectively for
decision-makers.
- It is a crucial step in the data science process.
Tools and Technologies in Data Science
Tools and Technologies: | Data Science
➠ Data visualization
- Various forms of data visualizations include:

a bar chart ✔ a treemap ✔


➠ which compares the size of ➠ which displays hierarchy
each component, data,
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data visualization
- Various forms of data visualizations include:
a line chart ✔ a map chart ✔
➠ which plots a series of data ➠ which displays data by
points over time location.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Data Visualization
- Here are some open source Data Visualization tools:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Model Building:
- This is where you train the data and analyze patterns with
machine learning algorithms.
- The system ‘learns’ how to provide predictions or decisions
by itself; you can then use this model to make predictions on
new, unseen data.
- Model building can be done using a service called IBM
Watson Machine Learning; it provides a full range of tools
and services for building models.

➠ Some other machine learning model building platform:


✔ Google Cloud AI Platform, ✔ Amazon SageMaker (AWS),
✔ Microsoft Azure Machine Learning, ✔ Databricks, BigML etc
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Model Deployment:
- The process of integrating a developed model into a
production environment.
- In model deployment, a machine learning model is made
available to third-party applications via APIs.
- Business users can access and interact with the data through
these third-party applications.
- and, so this helps them make data-driven decisions.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Model Deployment
- Here are some open source Model Deployment tools:
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Model monitoring and assessment
- It runs continuous quality checks to ensure a model’s
accuracy, fairness, and robustness.
- Model monitoring uses tools like Fiddler to track the
performance of deployed models in a production
environment.
- Now, model assessment uses evaluation metrics like the
✔F1-score, ✔ true positive rate, or ✔ the sum of
squared error to understand a model’s performance.
- A well-known example is the IBM Watson Open scale, which
continuously monitors deployed machine learning and deep
learning models.
- It will improve the accuracy and quality of your predictions.
Tools and Technologies in Data Science

Tools and Technologies: | Data Science


➠ Model monitoring and assessment
- Here are some open source Model monitoring and assessment
tools:
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Fully Integrated Visual Tools and Platform
✔ Watson Studio and Watson OpenScale: It covers the
complete development life cycle for all data science,
machine learning, and artificial intelligence (AI) tasks.

✔ Microsoft Azure Machine Learning: It is also a full


cloud-hosted offering supporting the complete development
life cycle of all data science, machine learning, and AI tasks.

✔ H2O Driverless AI: Although it is a product you download


and install, there exists a one-click deployment for the
standard cloud service providers; This cloud provider does
not do operations and maintenance, as with Watson Studio,
Open Scale, and Azure Machine Learning,
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Data Management
- software-as-a-service (SaaS) versions of existing open source
and commercial tools exist;
- The cloud provider operates the tool for you in the cloud;
- The cloud provider operates the product by backing up your
data and configuring and installing updates.
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Data Management
✔ Amazon Web Services DynamoDB is a NoSQL database
(database as a service). It allows storage and retrieving data
in a key-value or a document store format. The most
prominent document data structure is JSON.

✔ Cloudant is another database as a service offering; but in


the background, it is based on the open-source Apache
CouchDB.

✔ IBM DB2 service provided by IBM; it is an example of a


commercial database made available as a SaaS offering in
the cloud, taking away operational tasks from the user.
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Data Integration and Transformation
➠ Two commercial data integration tools widely used are:

✔ Informatica Cloud Data Integration, and


✔ IBM’s Data Refinery.

- Data Refinery is part of IBM Watson Studio; it allows


transforming large amounts of raw data into consumable,
quality information in a spreadsheet-like user interface.
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Data Visualization

✔ Datameer:
- A smaller company offering a cloud-based data visualization
tool is Datameer.
✔ IBM Cognos Analytics
- It is the service (Business intelligence suite) IBM offers as a
cloud solution for data visualization.
- IBM Data Refinery also offers data exploration and
visualization functionality in Watson Studio
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Model Building

✔ IBM Watson Machine Learning


- Watson Machine Learning can train and build models using
various open-source libraries.
✔ Google has a similar service on their cloud called AI
Platform Training.
- Every cloud provider has a solution for this task.
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Model Deployment

✔ IBM Watson Machine Learning


- Watson Machine Learning deploys a model and makes it
available to consumers using a REST interface.
Tools and Technologies in Data Science

Cloud Based Tools and Technologies: | Data Science


➠ Model Monitoring and Assessment

✔ Amazon SageMaker Model Monitor


- Amazon SageMaker Model Monitor is an example of a cloud
tool to monitor deployed machine learning and deep learning
models continuously.

✔ Watson OpenScale.
- It is another tool for model monitoring
Data Scientist and their Roles

Data Scientist: | Data Science


✔ Data scientists are skilled professional who use a combination
of statistical, analytical, and machine learning techniques to
extract valuable insights from data.
- They play a key role in analyzing vast amounts of structured
and unstructured data to solve complex business problems,
drive decision-making, and forecast trends.

➠ Key Skills and Knowledge Areas:


✔ Statistics and Mathematics: Essential for creating predictive
models and understanding data patterns.
✔ Programming: Proficiency in languages like Python, R, and
SQL is common, as they enable data manipulation, analysis,
and model building.
Data Scientist and their Roles

Data Scientist: | Data Science


➠ Key Skills and Knowledge Areas:
✔ Machine Learning and AI: Familiarity with machine
learning algorithms (e.g., regression, classification,
clustering) and deep learning for building complex models.
✔ Data Visualization: Ability to present findings through
visualizations (using tools like Tableau, Power BI, or Python
libraries) that are understandable to non-technical
stakeholders.
✔ Domain Knowledge: Understanding the industry they work
in (e.g., finance, healthcare, or retail) helps data scientists
apply relevant models and solutions.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Data Collection and Processing
- Role: Data scientists are responsible for identifying,
gathering, and cleaning data from various sources, ensuring
it’s reliable for analysis.

- Tasks: They preprocess and organize data using ETL


(Extract, Transform, Load) methods, which involve data
extraction, cleaning, transformation, and storage.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Exploratory Data Analysis (EDA)
- Role: Conducting EDA helps data scientists uncover
patterns, trends, and anomalies in the data.
- Tasks: They use statistical and visualization techniques to
summarize the main characteristics of datasets and draw
preliminary insights.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Feature Engineering
- Role: Data scientists create new features or modify existing
ones to enhance model accuracy and performance.
- Tasks: They select relevant features and apply techniques
such as scaling, encoding, and dimensionality reduction to
make data suitable for modeling.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Model Building and Evaluation
- Role: Data scientists build, train, and fine-tune machine
learning models to address specific business questions.
- Tasks: They select the right algorithms, train models,
validate performance using cross-validation and metrics like
accuracy, precision, and recall, and optimize
hyperparameters.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Machine Learning and Deep Learning
- Role: Advanced machine learning and deep learning skills
enable data scientists to tackle complex problems.
- Tasks: They implement algorithms for supervised,
unsupervised, and reinforcement learning, and may use deep
learning for tasks like image recognition or natural language
processing.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Data Visualization and Reporting
- Role: Data scientists present data findings to stakeholders in
a clear, actionable format.
- Tasks: They create visualizations and dashboards using
tools like Tableau, Power BI, or Python libraries (e.g.,
Matplotlib, Seaborn) to make complex insights accessible to
non-technical audiences.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Experimentation and Testing
- Role: Data scientists design experiments to test hypotheses,
new products, or features.
- Tasks: They set up and analyze controlled experiments
(e.g., A/B testing) to measure the impact of changes and
optimize decisions based on statistically valid results.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Model Deployment and Monitoring
- Role: Once models are developed, data scientists often help
deploy them in production environments. Tasks: They work
with data engineers and MLOps teams to monitor model
performance over time, ensuring continued accuracy and
detecting issues like data drift.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Collaboration and Communication
- Role: Data scientists work closely with cross-functional
teams to align data solutions with business goals.
- Tasks: They communicate insights and results to teams and
executives, translating technical findings into business
strategies and actionable insights.
Data Scientist and their Roles

Role of Data Scientist: | Data Science


➠ Ethics and Data Privacy
- Role: Data scientists ensure their work adheres to ethical
guidelines and data privacy laws, especially when handling
personal data.
- Tasks: They address issues like data security, bias in models,
and regulatory compliance, particularly in fields like finance
and healthcare where data privacy is crucial.
Module Assignment – As You Go

Module#1 Assignment is available at MS-Team.

Submission Deadline: 7th December 2024 (Before 3:00 PM)

You might also like