The document provides an overview of data science, including its purpose, processes, and tools used for data collection, analysis, and visualization. It discusses various types of data, big data characteristics, and the analytic process model, along with applications in fields like medical imaging and risk modeling. Additionally, it covers machine learning methods, data quality issues, and visualization techniques to effectively communicate insights derived from data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views3 pages
Ds Final
The document provides an overview of data science, including its purpose, processes, and tools used for data collection, analysis, and visualization. It discusses various types of data, big data characteristics, and the analytic process model, along with applications in fields like medical imaging and risk modeling. Additionally, it covers machine learning methods, data quality issues, and visualization techniques to effectively communicate insights derived from data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3
‘ ~ ’ = tips and generate data and o Based on current data analytics, E.g.
Handling errors in values like age
CHAPTER 1 insights make predefined future plans, goals, cannot be 120. Handling inconsistency in 1. Data science decisions and objectives value like Data science is about extraction, preparation, Industry E-commerce, Sales, image o producing an exam time-table some use Female and some use F. analysis, visualisation and maintenance of security services, recognition, such that no students have 2. Data Exploration ~ information which uses scientific methods and telecommunicatio advertisemen clashing schedules Explore to understand what data is in the processes to draw insights from data n t, risk dataset, what relationship is hidden within the 8. Analytic process model analytics 2. Data Identify business problem -> identify data data. Tools Hadoop, Spark, SAS, R, Collection of factual information based on sources -> select the data -> clean the data -> 3. Data Representation ~ Flink Python numbers, words, observations, measurements transform the data -> analyse the data -> How data is stored in the computer and which can be utilized for calculation, discussion interpret, evaluate and deploy the model involves assigning specific data structures 6. Purpose of data science and reasoning 9. Related software tools to the variables involved Find patterns within data Structured data~ SAS, Spark, BigML, MATLAB, EXEL, ggplot2 Complete the transformation of the raw Draw insights from the data Formatted, highly organized, easily 10. Data science application ~ data into a structured format dataset or Making predictions searchable & understandable by Machine Medical Image Analysis – Object detection, model that can be easily interpreted and Derive conclusions from data language. E.g. name, address - RDBMS, data science techniques can pinpoint and analysed. 7. Data analytics ~ CRM, ERP. outline specific structures or anomalies E.g. Use tables, matrices, arrays or Descriptive Analytics Unstructured data~ within medical images, such as detecting networks. o Based on live data, tells what’s Unformatted, unorganised, cannot process and segmenting tumours in MRI scans 4. Data Discovery ~ happening in real time & analyse by utilizing conventional Drug Discovery – Drug repurposing, data Discover insights and patterns through the o accurate & handy for operations methods and gadgets. E.g. text, audio - science techniques can analyse existing dataset. management NoSQL databases. drug data to identify new uses for existing This involves conduct hypothesis testing, o Easy to visualize 3. Big data ~ medications, which can be faster and less correlation analysis, or other analytical o e.g. monthly profit and loss Collection of data from various sources, often costly than developing new drug from techniques. characterized by 6vs and extraction analysis and statement 5. Learning from Data ~ scratch management of processing a large volume of Diagnostics Analytics Crucial stage. Risk Modelling Analysis – helps banking data o Automated Root Cause Analysis Build predictive models or statistical industry to formulate new strategies for Volume – the amount of data from myriad o Explains “why” things are algorithm using machine learning assessing their performance and allow sources happening banks to analyse how their loan will be techniques. Variety – the types of data: structured, o Helps troubleshot issues repaid in credit risk modelling Involves selecting the appropriate semi-structured, unstructured o e.g. isolate the root-cause of a Customer Segmentation – classification modelling approach, train Velocity – the speed at which big data is problem and clustering to determine potential the models on the prepared data and generated Predictive Analytics customers as well as segmenting customers evaluate the performance using suitable Veracity – the degree to which big data can o Tells what’s likely to happen? based on their common behaviours such as metrics. be trusted o Based on historical data, and identification of customers based on their 6. Creating a Data Product ~ Value – the business value of the data assumes a static business profitability in banking institutions Develop a data-driven solutions that collected plans/model Recommendation Engines – Suggest offers leverage the insights and models generated Variability – the ways in which big data can o Helps business decisions to be and extended services based on customer from the data. be used and formatted automated using algorithms transaction and personal information and Integrate the data analysis and modelling 4. Data science process ~ o e.g. the older a person the more estimate what products the customer results into practical applications such as Data exploration -> modelling(utilise ML susceptible they are to a heart maybe interested in buying by analysing recommendation systems, forecasting tools algorithm) -> model testing(check precision attack- we could say that age historical purchases or decision support systems. different qualities of model) -> model has a linear correlation with CHAPTER 2 7. Insight, Deliverance and Visualization ~ deployment heart-attack risk. These data are 1. Data Preparation ~ Communicate effectively the findings, 5. Data Science & Big Data then compiled together into a Turning the available data into dataset insights and the data analysis to the Factors Big data Data Science score or prediction Read and Cleansing of Data. stakeholders. Concept Handling large Analysing Prescriptive Analytics By using normalization, we can handle Create visualization, reports, and data data o Defines future actions i.e., outliers presentations to present the outcomes in a Responsibilit Process huge Understand “What to do next?” clear and understandable manner. y volumes of data pattern within CHAPTER 3 o Confidence, C = number of o Center the search window at using machine learning 1. Histogram ~ transactions supporting (X u Y) / the new mean location algorithms. Display frequency data using bars numbers of transactions with X o Repeat until convergence o Without human intervention, 2. Box plots ~ Basically means 10. Clustering these algorithms uncover Displaying the distribution of data based on numbers that have Detect homogeneous segments of hidden patterns or clusters of minimum, first quartile, median, third quartile, all data we want observations data. and maximum ~ including -> / all that Divisive hierarchical clustering: starts from 2. Use of machine learning include left side of -> the whole data set in one cluster, and then evaluate the information content of data, Sequence Rules breaks this up in each time smaller clusters provide useful insight , find useful patterns o Detect sequence of events until one observation per cluster remains in data without the need for any domain o E.g. Detect sequences pf (right to left).Divisive (top down) clustering: knowledge purchase behaviour in a It starts with all data points in one cluster, 3. R platform 3. Pie Chart ~ supermarket context the root. Integrated suite of software facilities for Display data, information and statistics in an easy Clustering Agglomerative clustering: starting from all data manipulation, calculation, and to read ‘pie-slice’ format with varying slice sizes o Detect homogeneous segments observations in one cluster and continuing graphical display telling you how much of one data element exits of observations to merge the ones that are most similar Advantages: Open source, data 4. Scatter plot ~ o E.g. Segment customer until all observations make up one big wrangling, platform independent, Display and analyse relationship between two population for targeted cluster (left to right). continuously growing continuous variables to reveal patterns, marketing CHAPTER 4 Disadvantages: weak origin, data correlations and anomalies in the data 1. Machine learning handling, complicated language 5. Bar Chart ~ 8. K-means Set of methods that can automatically detect CHAPTER 5 Represent and compare categorical data where Partitioning data points into disjoint cluster patterns in data, and then use the uncovered 1. Assumptions each bar corresponds to the value or frequency centroid, containing data points as to patterns to predict future data, or to perform Data arrive in rapid a stream or streams of the category it represents minimize the sum of squares criterion other kinds of decision making and are not immediately stored and will be 6. Data visualisation [partitions data into K distinct, non- Supervised ~ lost Convey information through visual overlapping subsets (clusters). The goal is o The data are labelled with pre- a. Examples: sensor data(ocean representation to communicate data clearly, to minimize the variance within each defined classes behaviour), image discover relationships and create fun and cluster] o Supervised Learning Algorithm data(satellites), Internet(IP interesting graphics Steps:- uses a training set to teach packets), Web traffic(search Functions of visualisations: record or store o Select k observations as initial models to yield the desired queries) information, analyse and support reasoning cluster centroids output. 2. Stream queries about information and to communicate or o Assign each observation to the o This training dataset includes a. Standing queries: permanently convey information to others cluster that has the closets inputs and correct outputs, executing & producing output at 7. Descriptive analytics centroid which allow the model to learn appropriate times Association Rules o Recalculate the position of the k over time. b. Ad-hoc queries: A question o Detect frequently occurring centroid o The algorithm measures its asked once patterns between items Problems – sensitive to initial points, must accuracy through the loss c. Issues in processing: real-time, o E.g. Detecting what products manually choose k function, adjusting until the executed in main memory are frequently purchased 9. Mean shift error has been sufficiently which a large number of together in a supermarket does not require specifying the number of minimised. streams together can exceed context clusters in advance. It works by finding the Unsupervised ~ the amount of available main o Support (X u Y) = number of densest areas of data points and shifting o Class labels of the data are memory & require invention of the centroid towards the mean of the data unknown. Used to establish new techniques. transaction supporting (X u Y)/ points in that region. existence of classes or clusters 3. Hash function total numbers of transaction Steps:- in the data Used to map data of arbitrary size to data of Basically means o Choose a search window o Unsupervised Learning fixed size. Values returned are called hash values. numbers that have Compute the mean of the data Algorithm analyses and 4. The bloom filter all data we want o in the search window aggregates unlabelled datasets Memory efficient and fast probabilistic data including -> / by structure that tells if an item is not in the set or whole dataset maybe in the set. It is also used to reduce I/O Problem of counting distinct elements in a users according to the degree of their interest in Authorities: pages valuable as they operations and increase performance stream = moments ( the distribution of each of the selected topic. provide info about a topic 5. The count-distinct problem frequencies of different elements in the stream) Steps: decide on the topic, pick the Hubs: valuable not because they tell a. Problem: how many different a. m= surprise number(how teleport set to compute topic- you where to find out about the topic elements have appeared in the uneven the distribution of sensitive PageRank vector for that CHAPTER 7 stream? elements in the stream) topic, Find a way to determine topic 1. Data quality: Defined as fitness for use b. Solution: counting from the b. n=length of stream or set topics that are most relevant 2. Multidimensional concept of data quality beginning of the stream and c. Alon-Matias-Szegedy slogorithm for particular search query, use Each dimension represents single aspect or keep them in hash table or Stream: pageRank vectors in the ordering of construct of data items and also comprises both search tree to add new a,b,c,b,d,a,c,d,a,b,d,c,a,a,b, the responses to the search queries. objective and subjective aspect elements and check if element n =15 3. Link Spam Intrinsic: Believability, objectivity, arrived was seen. a: 5 times, b: 4 times, c: 3 times, Methods used by spammers designed to fool the reputation c. Use secondary memory: many d: 3 times PageRank algorithm into overvaluing certain Contextual: Value-added, tests and updates to be the second moment for the pages(increasing PageRank) Completeness, relevancy, Appropriate performed on the data in that stream : 52+42+32+32=59 4. Spam Farm amount of data block, or estimate the number 9. Counting ones in a window Collection of pages to increase PageRank Representational: Interpretability, of distinct elements. Using Datar-Gionis-Indyk-Motwani Algorithm Webpages:- ease of understanding 6. Flajolet-Martin Algorithm a. The right side of bucket should Inaccessible pages: spammer cant Accessibility: accessibility, security Used to approximate number of distinct always start with 1, every affect 3. Data Quality problem causes elements in a single stream with a single pass bucket should have at least 1, Accessible pages: not controlled by Multiple data resources(duplicates), a. Input message + SHA-1 Hash else no bucket can be formed, spammer, can be affected by them subjective judgement(bias), limited Function = Hash Value all bucket sizes should be Own pages: owned and controlled by computing facilities(limit data access), b. Example:- power of 2. The buckets cant spammer size of data(high response time) Input stream: decrease in size as we move a 5. Combating link Spam 4. Creating and maintaining data quality 1,3,2,1,2,3,4,3,1,2,3,1 power 2. The bucket cannot Look for page that links to large Establish consistent metadata across h(x) = 6x +1 mod 5 decrease in size as we move to number of pages, each of which links different systems and be consistent = 6+1 mod 5 the left. back to it with data entry before integration = 7 mod 5 CHAPTER 6 TrustRank, a variation of topic- Schedule regular information H(1) = 2 1. PageRank sensitive PageRank to lower the score audits(review) Continue for the rest of the Function that assigns a real number to each page mass, a calculation that identifies 5. Benchmarking numbers. in the web. spam pages and eliminates it to lower Compare output and performance of analytical c. Calculate binary bit = convert to PageRank mode with reference model binary 6. TrustRank Benchmark(challenger) : find d. Find trailing zero = count how Let human examine and decide if web weakness of current analytical many zero after 1 in binary pages are trustworthy model(champion) and beat it, repeat e. Distinct element: find max Pick domain whose membership is process to perfect the current model. trailing zero and power 2 power contolled such as .edu 6. Privacy (max trailing) 7. Spam Mass Issues causes: illegal collection of data 7. Space requirements A= 1/3 at other pages,B=1/2 at A and D ,C=1 at Measures fraction of its PageRank & no say in how the data collected is a. One integer per hash function: A,D=1/2 at B and C that comes from spam used records the largest tail seen Negative/ small spam mass = not b. Process only one stream: use spam millions of hash function Spam mass close to 1 = spam c. Process many streams at the 8. Hubs and authorities(HITS: hyperlink- same time: main memory induced topic search) 2. Topic sensitive page rank constrains number of hash Computation deals with iterative computation of Certain topic weighted more heavily because of function fixedpoint involving repeated matrix-vector their topic. Surfers prefer to land on a page that 8. Estimating moments multiplication. is known to cover the chosen topic. It classifies