Ds Final

The document provides an overview of data science, including its purpose, processes, and tools used for data collection, analysis, and visualization. It discusses various types of data, big data characteristics, and the analytic process model, along with applications in fields like medical imaging and risk modeling. Additionally, it covers machine learning methods, data quality issues, and visualization techniques to effectively communicate insights derived from data.

Uploaded by

amberlylbma-wm23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views3 pages

Ds Final

Uploaded by

amberlylbma-wm23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

‘ ~ ’ = tips and generate data and o Based on current data analytics,  E.g.

Handling errors in values like age

CHAPTER 1 insights make predefined future plans, goals, cannot be 120. Handling inconsistency in
1. Data science decisions and objectives value like
Data science is about extraction, preparation, Industry E-commerce, Sales, image o producing an exam time-table  some use Female and some use F.
analysis, visualisation and maintenance of security services, recognition, such that no students have 2. Data Exploration ~
information which uses scientific methods and telecommunicatio advertisemen clashing schedules Explore to understand what data is in the
processes to draw insights from data n t, risk dataset, what relationship is hidden within the
8. Analytic process model
analytics
2. Data Identify business problem -> identify data data.
Tools Hadoop, Spark, SAS, R,
Collection of factual information based on sources -> select the data -> clean the data -> 3. Data Representation ~
Flink Python
numbers, words, observations, measurements transform the data -> analyse the data ->  How data is stored in the computer and
which can be utilized for calculation, discussion interpret, evaluate and deploy the model involves assigning specific data structures
6. Purpose of data science
and reasoning 9. Related software tools to the variables involved
 Find patterns within data
 Structured data~ SAS, Spark, BigML, MATLAB, EXEL, ggplot2  Complete the transformation of the raw
 Draw insights from the data
Formatted, highly organized, easily 10. Data science application ~ data into a structured format dataset or
 Making predictions
searchable & understandable by Machine  Medical Image Analysis – Object detection, model that can be easily interpreted and
 Derive conclusions from data
language. E.g. name, address - RDBMS, data science techniques can pinpoint and analysed.
7. Data analytics ~
CRM, ERP. outline specific structures or anomalies  E.g. Use tables, matrices, arrays or
 Descriptive Analytics
 Unstructured data~ within medical images, such as detecting networks.
o Based on live data, tells what’s
Unformatted, unorganised, cannot process and segmenting tumours in MRI scans 4. Data Discovery ~
happening in real time
& analyse by utilizing conventional  Drug Discovery – Drug repurposing, data  Discover insights and patterns through the
o accurate & handy for operations
methods and gadgets. E.g. text, audio - science techniques can analyse existing dataset.
management
NoSQL databases. drug data to identify new uses for existing  This involves conduct hypothesis testing,
o Easy to visualize
3. Big data ~ medications, which can be faster and less correlation analysis, or other analytical
o e.g. monthly profit and loss
Collection of data from various sources, often costly than developing new drug from techniques.
characterized by 6vs and extraction analysis and statement 5. Learning from Data ~
scratch
management of processing a large volume of  Diagnostics Analytics  Crucial stage.
 Risk Modelling Analysis – helps banking
data o Automated Root Cause Analysis  Build predictive models or statistical
industry to formulate new strategies for
 Volume – the amount of data from myriad o Explains “why” things are algorithm using machine learning
assessing their performance and allow
sources happening banks to analyse how their loan will be techniques.
 Variety – the types of data: structured, o Helps troubleshot issues repaid in credit risk modelling  Involves selecting the appropriate
semi-structured, unstructured o e.g. isolate the root-cause of a  Customer Segmentation – classification modelling approach, train
 Velocity – the speed at which big data is problem and clustering to determine potential  the models on the prepared data and
generated  Predictive Analytics customers as well as segmenting customers evaluate the performance using suitable
 Veracity – the degree to which big data can o Tells what’s likely to happen? based on their common behaviours such as metrics.
be trusted o Based on historical data, and identification of customers based on their 6. Creating a Data Product ~
 Value – the business value of the data assumes a static business profitability in banking institutions  Develop a data-driven solutions that
collected plans/model  Recommendation Engines – Suggest offers leverage the insights and models generated
 Variability – the ways in which big data can o Helps business decisions to be and extended services based on customer from the data.
be used and formatted automated using algorithms transaction and personal information and  Integrate the data analysis and modelling
4. Data science process ~ o e.g. the older a person the more estimate what products the customer results into practical applications such as
Data exploration -> modelling(utilise ML susceptible they are to a heart maybe interested in buying by analysing recommendation systems, forecasting tools
algorithm) -> model testing(check precision attack- we could say that age historical purchases or decision support systems.
different qualities of model) -> model has a linear correlation with CHAPTER 2 7. Insight, Deliverance and Visualization ~
deployment heart-attack risk. These data are 1. Data Preparation ~  Communicate effectively the findings,
5. Data Science & Big Data then compiled together into a  Turning the available data into dataset insights and the data analysis to the
Factors Big data Data Science score or prediction  Read and Cleansing of Data. stakeholders.
Concept Handling large Analysing  Prescriptive Analytics  By using normalization, we can handle  Create visualization, reports, and
data data o Defines future actions i.e., outliers presentations to present the outcomes in a
Responsibilit Process huge Understand “What to do next?” clear and understandable manner.
y volumes of data pattern within
CHAPTER 3 o Confidence, C = number of o Center the search window at using machine learning
1. Histogram ~ transactions supporting (X u Y) / the new mean location algorithms.
Display frequency data using bars numbers of transactions with X o Repeat until convergence o Without human intervention,
2. Box plots ~  Basically means 10. Clustering these algorithms uncover
Displaying the distribution of data based on numbers that have  Detect homogeneous segments of hidden patterns or clusters of
minimum, first quartile, median, third quartile, all data we want observations data.
and maximum ~ including -> / all that  Divisive hierarchical clustering: starts from 2. Use of machine learning
include left side of -> the whole data set in one cluster, and then evaluate the information content of data,
 Sequence Rules breaks this up in each time smaller clusters provide useful insight , find useful patterns
o Detect sequence of events until one observation per cluster remains in data without the need for any domain
o E.g. Detect sequences pf (right to left).Divisive (top down) clustering: knowledge
purchase behaviour in a It starts with all data points in one cluster, 3. R platform
3. Pie Chart ~ supermarket context the root. Integrated suite of software facilities for
Display data, information and statistics in an easy  Clustering  Agglomerative clustering: starting from all data manipulation, calculation, and
to read ‘pie-slice’ format with varying slice sizes o Detect homogeneous segments observations in one cluster and continuing graphical display
telling you how much of one data element exits of observations to merge the ones that are most similar  Advantages: Open source, data
4. Scatter plot ~ o E.g. Segment customer until all observations make up one big wrangling, platform independent,
Display and analyse relationship between two population for targeted cluster (left to right). continuously growing
continuous variables to reveal patterns, marketing CHAPTER 4  Disadvantages: weak origin, data
correlations and anomalies in the data 1. Machine learning handling, complicated language
5. Bar Chart ~ 8. K-means Set of methods that can automatically detect CHAPTER 5
Represent and compare categorical data where  Partitioning data points into disjoint cluster patterns in data, and then use the uncovered 1. Assumptions
each bar corresponds to the value or frequency centroid, containing data points as to patterns to predict future data, or to perform Data arrive in rapid a stream or streams
of the category it represents minimize the sum of squares criterion other kinds of decision making and are not immediately stored and will be
6. Data visualisation [partitions data into K distinct, non-  Supervised ~ lost
Convey information through visual overlapping subsets (clusters). The goal is o The data are labelled with pre- a. Examples: sensor data(ocean
representation to communicate data clearly, to minimize the variance within each defined classes behaviour), image
discover relationships and create fun and cluster] o Supervised Learning Algorithm data(satellites), Internet(IP
interesting graphics  Steps:- uses a training set to teach packets), Web traffic(search
 Functions of visualisations: record or store o Select k observations as initial models to yield the desired queries)
information, analyse and support reasoning cluster centroids output. 2. Stream queries
about information and to communicate or o Assign each observation to the o This training dataset includes a. Standing queries: permanently
convey information to others cluster that has the closets inputs and correct outputs, executing & producing output at
7. Descriptive analytics centroid which allow the model to learn appropriate times
 Association Rules o Recalculate the position of the k over time. b. Ad-hoc queries: A question
o Detect frequently occurring centroid o The algorithm measures its asked once
patterns between items  Problems – sensitive to initial points, must accuracy through the loss c. Issues in processing: real-time,
o E.g. Detecting what products manually choose k function, adjusting until the executed in main memory
are frequently purchased 9. Mean shift error has been sufficiently which a large number of
together in a supermarket  does not require specifying the number of minimised. streams together can exceed
context clusters in advance. It works by finding the  Unsupervised ~ the amount of available main
o Support (X u Y) = number of densest areas of data points and shifting o Class labels of the data are memory & require invention of
the centroid towards the mean of the data unknown. Used to establish new techniques.
transaction supporting (X u Y)/
points in that region. existence of classes or clusters 3. Hash function
total numbers of transaction
 Steps:- in the data Used to map data of arbitrary size to data of
 Basically means
o Choose a search window o Unsupervised Learning fixed size. Values returned are called hash values.
numbers that have
Compute the mean of the data Algorithm analyses and 4. The bloom filter
all data we want o
in the search window aggregates unlabelled datasets Memory efficient and fast probabilistic data
including -> / by
structure that tells if an item is not in the set or
whole dataset
maybe in the set. It is also used to reduce I/O Problem of counting distinct elements in a users according to the degree of their interest in  Authorities: pages valuable as they
operations and increase performance stream = moments ( the distribution of each of the selected topic. provide info about a topic
5. The count-distinct problem frequencies of different elements in the stream)  Steps: decide on the topic, pick the  Hubs: valuable not because they tell
a. Problem: how many different a. m= surprise number(how teleport set to compute topic- you where to find out about the topic
elements have appeared in the uneven the distribution of sensitive PageRank vector for that CHAPTER 7
stream? elements in the stream) topic, Find a way to determine topic 1. Data quality: Defined as fitness for use
b. Solution: counting from the b. n=length of stream or set topics that are most relevant 2. Multidimensional concept of data quality
beginning of the stream and c. Alon-Matias-Szegedy slogorithm for particular search query, use Each dimension represents single aspect or
keep them in hash table or Stream: pageRank vectors in the ordering of construct of data items and also comprises both
search tree to add new a,b,c,b,d,a,c,d,a,b,d,c,a,a,b, the responses to the search queries. objective and subjective aspect
elements and check if element n =15 3. Link Spam  Intrinsic: Believability, objectivity,
arrived was seen. a: 5 times, b: 4 times, c: 3 times, Methods used by spammers designed to fool the reputation
c. Use secondary memory: many d: 3 times PageRank algorithm into overvaluing certain  Contextual: Value-added,
tests and updates to be the second moment for the pages(increasing PageRank) Completeness, relevancy, Appropriate
performed on the data in that stream : 52+42+32+32=59 4. Spam Farm amount of data
block, or estimate the number 9. Counting ones in a window Collection of pages to increase PageRank  Representational: Interpretability,
of distinct elements. Using Datar-Gionis-Indyk-Motwani Algorithm Webpages:- ease of understanding
6. Flajolet-Martin Algorithm a. The right side of bucket should  Inaccessible pages: spammer cant  Accessibility: accessibility, security
Used to approximate number of distinct always start with 1, every affect 3. Data Quality problem causes
elements in a single stream with a single pass bucket should have at least 1,  Accessible pages: not controlled by  Multiple data resources(duplicates),
a. Input message + SHA-1 Hash else no bucket can be formed, spammer, can be affected by them subjective judgement(bias), limited
Function = Hash Value all bucket sizes should be  Own pages: owned and controlled by computing facilities(limit data access),
b. Example:- power of 2. The buckets cant spammer size of data(high response time)
Input stream: decrease in size as we move a 5. Combating link Spam 4. Creating and maintaining data quality
1,3,2,1,2,3,4,3,1,2,3,1 power 2. The bucket cannot  Look for page that links to large  Establish consistent metadata across
h(x) = 6x +1 mod 5 decrease in size as we move to number of pages, each of which links different systems and be consistent
= 6+1 mod 5 the left. back to it with data entry before integration
= 7 mod 5 CHAPTER 6  TrustRank, a variation of topic-  Schedule regular information
H(1) = 2 1. PageRank sensitive PageRank to lower the score audits(review)
Continue for the rest of the Function that assigns a real number to each page mass, a calculation that identifies 5. Benchmarking
numbers. in the web. spam pages and eliminates it to lower Compare output and performance of analytical
c. Calculate binary bit = convert to PageRank mode with reference model
binary 6. TrustRank  Benchmark(challenger) : find
d. Find trailing zero = count how  Let human examine and decide if web weakness of current analytical
many zero after 1 in binary pages are trustworthy model(champion) and beat it, repeat
e. Distinct element: find max  Pick domain whose membership is process to perfect the current model.
trailing zero and power 2 power contolled such as .edu 6. Privacy
(max trailing) 7. Spam Mass  Issues causes: illegal collection of data
7. Space requirements A= 1/3 at other pages,B=1/2 at A and D ,C=1 at  Measures fraction of its PageRank & no say in how the data collected is
a. One integer per hash function: A,D=1/2 at B and C that comes from spam used
records the largest tail seen  Negative/ small spam mass = not
b. Process only one stream: use spam
millions of hash function  Spam mass close to 1 = spam
c. Process many streams at the 8. Hubs and authorities(HITS: hyperlink-
same time: main memory induced topic search)
2. Topic sensitive page rank
constrains number of hash Computation deals with iterative computation of
Certain topic weighted more heavily because of
function fixedpoint involving repeated matrix-vector
their topic. Surfers prefer to land on a page that
8. Estimating moments multiplication.
is known to cover the chosen topic. It classifies

1. Phần Mềm Cài Đặt: 2.1. Install Oracle Goldengate For Oracle
No ratings yet
1. Phần Mềm Cài Đặt: 2.1. Install Oracle Goldengate For Oracle
9 pages
Intro
No ratings yet
Intro
144 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
sg245410 PDF
100% (1)
sg245410 PDF
464 pages
Data Science MBA
No ratings yet
Data Science MBA
6 pages
Resume Komal
No ratings yet
Resume Komal
3 pages
Priciples of Business Analytics - Masters - University of Adelaide
No ratings yet
Priciples of Business Analytics - Masters - University of Adelaide
27 pages
Prakash Agarwal Software Developer
No ratings yet
Prakash Agarwal Software Developer
1 page
Backend Architecture
No ratings yet
Backend Architecture
2 pages
Digital Signatures Security Technologies Firewalls and VPNS
No ratings yet
Digital Signatures Security Technologies Firewalls and VPNS
44 pages
MCA Cloud Storage Report
No ratings yet
MCA Cloud Storage Report
13 pages
VCR GBK Alang September 2023
No ratings yet
VCR GBK Alang September 2023
6 pages
Keyscan System VII (7.0.19) User Quick Reference Guide: Table of Content
No ratings yet
Keyscan System VII (7.0.19) User Quick Reference Guide: Table of Content
8 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
Project Description: Cloudfuze Is A Web Application Which Helps Users To Migrate Their Cloud Data Between
No ratings yet
Project Description: Cloudfuze Is A Web Application Which Helps Users To Migrate Their Cloud Data Between
2 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
4.0.3-NexentaStor User Guide
No ratings yet
4.0.3-NexentaStor User Guide
265 pages
Power Ha
No ratings yet
Power Ha
726 pages
Data Processes
No ratings yet
Data Processes
4 pages
Intro To Python Programming Course Layout
No ratings yet
Intro To Python Programming Course Layout
1 page
Documentation Map: SAP Data Services Document Version: 4.2 Support Package 8 (14.2.8.0) - 2016-12-07
No ratings yet
Documentation Map: SAP Data Services Document Version: 4.2 Support Package 8 (14.2.8.0) - 2016-12-07
12 pages
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
No ratings yet
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
54 pages
Introduction To Data Science and Python For Data
No ratings yet
Introduction To Data Science and Python For Data
12 pages
Ikram-Solutions Myjobs System Use Case Specification: Register Job Seeker
No ratings yet
Ikram-Solutions Myjobs System Use Case Specification: Register Job Seeker
7 pages
20 Assertions
No ratings yet
20 Assertions
4 pages
2022 Ct505ni LB6 210495981 C10
No ratings yet
2022 Ct505ni LB6 210495981 C10
16 pages
Ads Phoenix Circus School
No ratings yet
Ads Phoenix Circus School
11 pages
MCF OD Azure Fundamentals AZ-900 0822
No ratings yet
MCF OD Azure Fundamentals AZ-900 0822
2 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Hyperledger Fabric PDF
100% (1)
Hyperledger Fabric PDF
445 pages
Chapter 1 - Intr To DS and Business Understanding
No ratings yet
Chapter 1 - Intr To DS and Business Understanding
35 pages
Overview and Ongoing Works of TMForum
No ratings yet
Overview and Ongoing Works of TMForum
7 pages
Robocopy - Exe Robust File Copy Utility Version XP010: Important
No ratings yet
Robocopy - Exe Robust File Copy Utility Version XP010: Important
35 pages
Unit-3 Intr Data Science
No ratings yet
Unit-3 Intr Data Science
150 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Ids Unit I
No ratings yet
Ids Unit I
46 pages
A Functional Approach To Basics of Data Science With Excel-Book - Chapter 1 and 2 - 1st Print
No ratings yet
A Functional Approach To Basics of Data Science With Excel-Book - Chapter 1 and 2 - 1st Print
13 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Music Player Using Python: Submitted By: Mayank Kumar (1808210088)
No ratings yet
Music Player Using Python: Submitted By: Mayank Kumar (1808210088)
18 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Big Data
No ratings yet
Big Data
4 pages
DTS 201 Lecture Note
No ratings yet
DTS 201 Lecture Note
24 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Secured Document Storing Using Blockchain
No ratings yet
Secured Document Storing Using Blockchain
8 pages
Data Science
No ratings yet
Data Science
17 pages
Intro To Data and Data Science
No ratings yet
Intro To Data and Data Science
9 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Datascience
No ratings yet
Datascience
12 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Science
No ratings yet
Data Science
8 pages
TRAINING Report
No ratings yet
TRAINING Report
32 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
54 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Data Science
No ratings yet
Data Science
5 pages
Data Science Unlocked
No ratings yet
Data Science Unlocked
35 pages
01 Introduction
No ratings yet
01 Introduction
7 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Data Science
No ratings yet
Data Science
207 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
CIS Controls v8 Mapping To UK NCSC Cyber Assessment Framework v3.1 - 8-16-2022
No ratings yet
CIS Controls v8 Mapping To UK NCSC Cyber Assessment Framework v3.1 - 8-16-2022
115 pages
TIB Bwpluginrestjson 2.0.0 Relnotes
No ratings yet
TIB Bwpluginrestjson 2.0.0 Relnotes
18 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
Bhijit R Fadte: Profile Summary
No ratings yet
Bhijit R Fadte: Profile Summary
9 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Science
No ratings yet
Data Science
10 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
File
No ratings yet
File
27 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
C Sharp: Presentation By:-Maheshwar Pandey XRG Consulting Pvt. Ltd. Mob - No - 7015337463
No ratings yet
C Sharp: Presentation By:-Maheshwar Pandey XRG Consulting Pvt. Ltd. Mob - No - 7015337463
12 pages
Finite and Infinite Scheduling in PP - DS
No ratings yet
Finite and Infinite Scheduling in PP - DS
6 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Data-Driven Agentic AI: Integrating Data Science and Machine Learning
From Everand
Data-Driven Agentic AI: Integrating Data Science and Machine Learning
Anand Vemula
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Ds Final

Uploaded by

Ds Final

Uploaded by

‘ ~ ’ = tips and generate data and o Based on current data analytics,  E.g.

Handling errors in values like age

You might also like