0% found this document useful (0 votes)
7 views3 pages

Ds Final

The document provides an overview of data science, including its purpose, processes, and tools used for data collection, analysis, and visualization. It discusses various types of data, big data characteristics, and the analytic process model, along with applications in fields like medical imaging and risk modeling. Additionally, it covers machine learning methods, data quality issues, and visualization techniques to effectively communicate insights derived from data.

Uploaded by

amberlylbma-wm23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views3 pages

Ds Final

The document provides an overview of data science, including its purpose, processes, and tools used for data collection, analysis, and visualization. It discusses various types of data, big data characteristics, and the analytic process model, along with applications in fields like medical imaging and risk modeling. Additionally, it covers machine learning methods, data quality issues, and visualization techniques to effectively communicate insights derived from data.

Uploaded by

amberlylbma-wm23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

‘ ~ ’ = tips and generate data and o Based on current data analytics,  E.g.

Handling errors in values like age


CHAPTER 1 insights make predefined future plans, goals, cannot be 120. Handling inconsistency in
1. Data science decisions and objectives value like
Data science is about extraction, preparation, Industry E-commerce, Sales, image o producing an exam time-table  some use Female and some use F.
analysis, visualisation and maintenance of security services, recognition, such that no students have 2. Data Exploration ~
information which uses scientific methods and telecommunicatio advertisemen clashing schedules Explore to understand what data is in the
processes to draw insights from data n t, risk dataset, what relationship is hidden within the
8. Analytic process model
analytics
2. Data Identify business problem -> identify data data.
Tools Hadoop, Spark, SAS, R,
Collection of factual information based on sources -> select the data -> clean the data -> 3. Data Representation ~
Flink Python
numbers, words, observations, measurements transform the data -> analyse the data ->  How data is stored in the computer and
which can be utilized for calculation, discussion interpret, evaluate and deploy the model involves assigning specific data structures
6. Purpose of data science
and reasoning 9. Related software tools to the variables involved
 Find patterns within data
 Structured data~ SAS, Spark, BigML, MATLAB, EXEL, ggplot2  Complete the transformation of the raw
 Draw insights from the data
Formatted, highly organized, easily 10. Data science application ~ data into a structured format dataset or
 Making predictions
searchable & understandable by Machine  Medical Image Analysis – Object detection, model that can be easily interpreted and
 Derive conclusions from data
language. E.g. name, address - RDBMS, data science techniques can pinpoint and analysed.
7. Data analytics ~
CRM, ERP. outline specific structures or anomalies  E.g. Use tables, matrices, arrays or
 Descriptive Analytics
 Unstructured data~ within medical images, such as detecting networks.
o Based on live data, tells what’s
Unformatted, unorganised, cannot process and segmenting tumours in MRI scans 4. Data Discovery ~
happening in real time
& analyse by utilizing conventional  Drug Discovery – Drug repurposing, data  Discover insights and patterns through the
o accurate & handy for operations
methods and gadgets. E.g. text, audio - science techniques can analyse existing dataset.
management
NoSQL databases. drug data to identify new uses for existing  This involves conduct hypothesis testing,
o Easy to visualize
3. Big data ~ medications, which can be faster and less correlation analysis, or other analytical
o e.g. monthly profit and loss
Collection of data from various sources, often costly than developing new drug from techniques.
characterized by 6vs and extraction analysis and statement 5. Learning from Data ~
scratch
management of processing a large volume of  Diagnostics Analytics  Crucial stage.
 Risk Modelling Analysis – helps banking
data o Automated Root Cause Analysis  Build predictive models or statistical
industry to formulate new strategies for
 Volume – the amount of data from myriad o Explains “why” things are algorithm using machine learning
assessing their performance and allow
sources happening banks to analyse how their loan will be techniques.
 Variety – the types of data: structured, o Helps troubleshot issues repaid in credit risk modelling  Involves selecting the appropriate
semi-structured, unstructured o e.g. isolate the root-cause of a  Customer Segmentation – classification modelling approach, train
 Velocity – the speed at which big data is problem and clustering to determine potential  the models on the prepared data and
generated  Predictive Analytics customers as well as segmenting customers evaluate the performance using suitable
 Veracity – the degree to which big data can o Tells what’s likely to happen? based on their common behaviours such as metrics.
be trusted o Based on historical data, and identification of customers based on their 6. Creating a Data Product ~
 Value – the business value of the data assumes a static business profitability in banking institutions  Develop a data-driven solutions that
collected plans/model  Recommendation Engines – Suggest offers leverage the insights and models generated
 Variability – the ways in which big data can o Helps business decisions to be and extended services based on customer from the data.
be used and formatted automated using algorithms transaction and personal information and  Integrate the data analysis and modelling
4. Data science process ~ o e.g. the older a person the more estimate what products the customer results into practical applications such as
Data exploration -> modelling(utilise ML susceptible they are to a heart maybe interested in buying by analysing recommendation systems, forecasting tools
algorithm) -> model testing(check precision attack- we could say that age historical purchases or decision support systems.
different qualities of model) -> model has a linear correlation with CHAPTER 2 7. Insight, Deliverance and Visualization ~
deployment heart-attack risk. These data are 1. Data Preparation ~  Communicate effectively the findings,
5. Data Science & Big Data then compiled together into a  Turning the available data into dataset insights and the data analysis to the
Factors Big data Data Science score or prediction  Read and Cleansing of Data. stakeholders.
Concept Handling large Analysing  Prescriptive Analytics  By using normalization, we can handle  Create visualization, reports, and
data data o Defines future actions i.e., outliers presentations to present the outcomes in a
Responsibilit Process huge Understand “What to do next?” clear and understandable manner.
y volumes of data pattern within
CHAPTER 3 o Confidence, C = number of o Center the search window at using machine learning
1. Histogram ~ transactions supporting (X u Y) / the new mean location algorithms.
Display frequency data using bars numbers of transactions with X o Repeat until convergence o Without human intervention,
2. Box plots ~  Basically means 10. Clustering these algorithms uncover
Displaying the distribution of data based on numbers that have  Detect homogeneous segments of hidden patterns or clusters of
minimum, first quartile, median, third quartile, all data we want observations data.
and maximum ~ including -> / all that  Divisive hierarchical clustering: starts from 2. Use of machine learning
include left side of -> the whole data set in one cluster, and then evaluate the information content of data,
 Sequence Rules breaks this up in each time smaller clusters provide useful insight , find useful patterns
o Detect sequence of events until one observation per cluster remains in data without the need for any domain
o E.g. Detect sequences pf (right to left).Divisive (top down) clustering: knowledge
purchase behaviour in a It starts with all data points in one cluster, 3. R platform
3. Pie Chart ~ supermarket context the root. Integrated suite of software facilities for
Display data, information and statistics in an easy  Clustering  Agglomerative clustering: starting from all data manipulation, calculation, and
to read ‘pie-slice’ format with varying slice sizes o Detect homogeneous segments observations in one cluster and continuing graphical display
telling you how much of one data element exits of observations to merge the ones that are most similar  Advantages: Open source, data
4. Scatter plot ~ o E.g. Segment customer until all observations make up one big wrangling, platform independent,
Display and analyse relationship between two population for targeted cluster (left to right). continuously growing
continuous variables to reveal patterns, marketing CHAPTER 4  Disadvantages: weak origin, data
correlations and anomalies in the data 1. Machine learning handling, complicated language
5. Bar Chart ~ 8. K-means Set of methods that can automatically detect CHAPTER 5
Represent and compare categorical data where  Partitioning data points into disjoint cluster patterns in data, and then use the uncovered 1. Assumptions
each bar corresponds to the value or frequency centroid, containing data points as to patterns to predict future data, or to perform Data arrive in rapid a stream or streams
of the category it represents minimize the sum of squares criterion other kinds of decision making and are not immediately stored and will be
6. Data visualisation [partitions data into K distinct, non-  Supervised ~ lost
Convey information through visual overlapping subsets (clusters). The goal is o The data are labelled with pre- a. Examples: sensor data(ocean
representation to communicate data clearly, to minimize the variance within each defined classes behaviour), image
discover relationships and create fun and cluster] o Supervised Learning Algorithm data(satellites), Internet(IP
interesting graphics  Steps:- uses a training set to teach packets), Web traffic(search
 Functions of visualisations: record or store o Select k observations as initial models to yield the desired queries)
information, analyse and support reasoning cluster centroids output. 2. Stream queries
about information and to communicate or o Assign each observation to the o This training dataset includes a. Standing queries: permanently
convey information to others cluster that has the closets inputs and correct outputs, executing & producing output at
7. Descriptive analytics centroid which allow the model to learn appropriate times
 Association Rules o Recalculate the position of the k over time. b. Ad-hoc queries: A question
o Detect frequently occurring centroid o The algorithm measures its asked once
patterns between items  Problems – sensitive to initial points, must accuracy through the loss c. Issues in processing: real-time,
o E.g. Detecting what products manually choose k function, adjusting until the executed in main memory
are frequently purchased 9. Mean shift error has been sufficiently which a large number of
together in a supermarket  does not require specifying the number of minimised. streams together can exceed
context clusters in advance. It works by finding the  Unsupervised ~ the amount of available main
o Support (X u Y) = number of densest areas of data points and shifting o Class labels of the data are memory & require invention of
the centroid towards the mean of the data unknown. Used to establish new techniques.
transaction supporting (X u Y)/
points in that region. existence of classes or clusters 3. Hash function
total numbers of transaction
 Steps:- in the data Used to map data of arbitrary size to data of
 Basically means
o Choose a search window o Unsupervised Learning fixed size. Values returned are called hash values.
numbers that have
Compute the mean of the data Algorithm analyses and 4. The bloom filter
all data we want o
in the search window aggregates unlabelled datasets Memory efficient and fast probabilistic data
including -> / by
structure that tells if an item is not in the set or
whole dataset
maybe in the set. It is also used to reduce I/O Problem of counting distinct elements in a users according to the degree of their interest in  Authorities: pages valuable as they
operations and increase performance stream = moments ( the distribution of each of the selected topic. provide info about a topic
5. The count-distinct problem frequencies of different elements in the stream)  Steps: decide on the topic, pick the  Hubs: valuable not because they tell
a. Problem: how many different a. m= surprise number(how teleport set to compute topic- you where to find out about the topic
elements have appeared in the uneven the distribution of sensitive PageRank vector for that CHAPTER 7
stream? elements in the stream) topic, Find a way to determine topic 1. Data quality: Defined as fitness for use
b. Solution: counting from the b. n=length of stream or set topics that are most relevant 2. Multidimensional concept of data quality
beginning of the stream and c. Alon-Matias-Szegedy slogorithm for particular search query, use Each dimension represents single aspect or
keep them in hash table or Stream: pageRank vectors in the ordering of construct of data items and also comprises both
search tree to add new a,b,c,b,d,a,c,d,a,b,d,c,a,a,b, the responses to the search queries. objective and subjective aspect
elements and check if element n =15 3. Link Spam  Intrinsic: Believability, objectivity,
arrived was seen. a: 5 times, b: 4 times, c: 3 times, Methods used by spammers designed to fool the reputation
c. Use secondary memory: many d: 3 times PageRank algorithm into overvaluing certain  Contextual: Value-added,
tests and updates to be the second moment for the pages(increasing PageRank) Completeness, relevancy, Appropriate
performed on the data in that stream : 52+42+32+32=59 4. Spam Farm amount of data
block, or estimate the number 9. Counting ones in a window Collection of pages to increase PageRank  Representational: Interpretability,
of distinct elements. Using Datar-Gionis-Indyk-Motwani Algorithm Webpages:- ease of understanding
6. Flajolet-Martin Algorithm a. The right side of bucket should  Inaccessible pages: spammer cant  Accessibility: accessibility, security
Used to approximate number of distinct always start with 1, every affect 3. Data Quality problem causes
elements in a single stream with a single pass bucket should have at least 1,  Accessible pages: not controlled by  Multiple data resources(duplicates),
a. Input message + SHA-1 Hash else no bucket can be formed, spammer, can be affected by them subjective judgement(bias), limited
Function = Hash Value all bucket sizes should be  Own pages: owned and controlled by computing facilities(limit data access),
b. Example:- power of 2. The buckets cant spammer size of data(high response time)
Input stream: decrease in size as we move a 5. Combating link Spam 4. Creating and maintaining data quality
1,3,2,1,2,3,4,3,1,2,3,1 power 2. The bucket cannot  Look for page that links to large  Establish consistent metadata across
h(x) = 6x +1 mod 5 decrease in size as we move to number of pages, each of which links different systems and be consistent
= 6+1 mod 5 the left. back to it with data entry before integration
= 7 mod 5 CHAPTER 6  TrustRank, a variation of topic-  Schedule regular information
H(1) = 2 1. PageRank sensitive PageRank to lower the score audits(review)
Continue for the rest of the Function that assigns a real number to each page mass, a calculation that identifies 5. Benchmarking
numbers. in the web. spam pages and eliminates it to lower Compare output and performance of analytical
c. Calculate binary bit = convert to PageRank mode with reference model
binary 6. TrustRank  Benchmark(challenger) : find
d. Find trailing zero = count how  Let human examine and decide if web weakness of current analytical
many zero after 1 in binary pages are trustworthy model(champion) and beat it, repeat
e. Distinct element: find max  Pick domain whose membership is process to perfect the current model.
trailing zero and power 2 power contolled such as .edu 6. Privacy
(max trailing) 7. Spam Mass  Issues causes: illegal collection of data
7. Space requirements A= 1/3 at other pages,B=1/2 at A and D ,C=1 at  Measures fraction of its PageRank & no say in how the data collected is
a. One integer per hash function: A,D=1/2 at B and C that comes from spam used
records the largest tail seen  Negative/ small spam mass = not
b. Process only one stream: use spam
millions of hash function  Spam mass close to 1 = spam
c. Process many streams at the 8. Hubs and authorities(HITS: hyperlink-
same time: main memory induced topic search)
2. Topic sensitive page rank
constrains number of hash Computation deals with iterative computation of
Certain topic weighted more heavily because of
function fixedpoint involving repeated matrix-vector
their topic. Surfers prefer to land on a page that
8. Estimating moments multiplication.
is known to cover the chosen topic. It classifies

You might also like