Knowledge Discovery in Data Science: KDD Meets Big Data

Uploaded by

Patricio Ulloa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views6 pages

Knowledge Discovery in Data Science: KDD Meets Big Data

Uploaded by

Patricio Ulloa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2016 IEEE International Conference on Big Data (Big Data)

Knowledge Discovery in Data Science

KDD meets Big Data

Nancy.W. Grady
Cyber, Cloud, and Data Science
Science Applications International Corporation
Oak Ridge, TN
[email protected]

Abstract—Cross-Industry Standard process model (CRISP- it remains the most commonly used process model, it has a
DM) was developed in the late 90s by a consortium of industry number of shortcomings, including the need for better
participants to facilitate the end-to-end data mining process for integration with management processes[11][12], the need to
Knowledge Discovery in Databases (KDD). While there have been align with software[13][14] and agile development
efforts to better integrate with management and software methodologies[15][16], and the need for method guidance for
development practices, there are no extensions to handle the new individual activities within stages instead of simple
activities involved in using big data technologies. Data Science checklists[17]. The history and relationship of many process
Edge (DSE) is an enhanced process model to accommodate big
models has been represented by a provenance diagram[18] as
data technologies and data science activities. In recognition of the
well as through side-by-side comparison [19].
changes, the author promotes the use of a new term, Knowledge
Discovery in Data Science (KDDS) as a call for the community to Big data has changed the technology landscape for data-
develop a new industry standard data science process model. intensive applications. The fundamental change is the
distribution of data storage across nodes as well as the
Keywords—data science, big data, knowledge discovery, KDD, distribution of analytics to run in parallel against the data on
KDDM, KDDS, data mining, analytics, analytics lifecycle, data those nodes. While there is some discussion of a 4-stage process
lifecycle
model for the Internet of Things (IoT) [20] (which represents the
I. INTRODUCTION high velocity case for big data), there has yet been no
presentation of the specific changes needed to extend KDD
Experimental design was an early focus of statistics, where across the new big data technologies. A description of the need
data were collected or sampled as necessary and sufficient to for an overarching data science model – along with the needed
definitively answer a hypothesis. This method is still used today, alignment to other models for software development, operations
for example, in the pharmaceutical industry where clinical trials research, business intelligence and capability maturity – has
are designed to comprehensively justify the efficacy and safety been well-articulated[21].
of a drug for market usage. The best-known process model for
statistics is the SAS Institute’s Sample Explore Modify Model This paper presents an overview of the specific changes
Assess (SEMMA) [1] [2] model. arising in applications that use big data, to cover the new
demands for what could be termed Knowledge Discovery in
While traditional statistics provided a framework for the Data Science (KDDS). The focus here is on analytics in big data
careful—and typically definitive—analysis of data, systems implimentations, with the addition of concerns in a
organizations began to also look for value in data that were industry or government (I&G) context.
collected for other purposes. Here approximate answers or
patterns were sufficient and the results typically needed to be
deployed to generate value. The term “data mining” was coined
in 1989 [3] to refer to the application of algorithms for extracting
patterns from data, whereas Knowledge Discovery in Data
(KDD) – also described as Knowledge Discovery in Data
Mining (KDDM)[4] – refers to the broader process[5].
It quickly became clear that some standard guidance was
needed for the stages involved in KDD. Several resources
provided their own stages for the process (e.g. 9 stages [4], 5
stages [6], 4 or 11 stages [7], and 6 stages [8]) until a standard
was established in the late 1990s by the Cross Industry Standard
Process Model for Data Mining (CRISP-DM) [9], as shown in
Fig. 1.
CRISP-DM established a 6- stage lifecycle for KDD that has Fig. 1. CRISP-DM process model (Source [9])
remained the de facto standard for data mining since [10]. While

This work was supported on SAIC internal Research and Development.

978-1-4673-9005-7/16/$31.00 ©2016 IEEE 1603
II. BIG DATA AND DATA SCIENCE choices among big data frameworks which must be made – for
The paradigm shift known as big data occurred in the mid- processing, platform, and infrastructure capabilities. Like all
2000s with the advent of new techniques for large data file I&G systems, security and privacy and management are the
storage (HDFS), physically distributed logical datasets, remaining roles for analytics systems. Analogously to cloud
(Hadoop), and parallel processing on the distributed data applications, big data systems can involve several different
(MapReduce). The Hadoop ecosystem relies on horiztontal organizations within each of these roles. The additional
scaling to distribute data across independent nodes, with complexity from this distribution of responsibilities makes the
scalability arising from the addition of nodes. The result is the need for a KDDS process model even more important.
use of parallelism for scalable, data-intensive application
development. While it has been given a number of differing C. What We Mean by Data Science
conceptual definitions [22], big data is not “bigger data” than “Data Scientist” is a skill description often assigned to
common techniques can handle, but rather data that requires anyone who performs any activity that touches data, including
parallel processing to meet the time constraints for end-to-end data management, data systems, data analytics, etc. In reality,
analysis performance while at an affordable cost. many specializations are for “vertical” subject matter experts;
Data Architects, Big Data Engineers, Data Analysts, Machine
A. What We Mean by Big Data
Learning experts, and so on. Being a “horizontal” Data Scientist
Fundamentally, big data refers to horizontal scaling in refers to one having expertise in several disciplines, as shown
dataset storage and in analytics. Big data represents a new in the NIST definition of Data Scientist.
paradigm for data-intensive systems - similar to the paradigm
shifts that occurred with the advent of relational database
A Data Scientist is a practitioner who has
management systems or with the shift to massively parallel
processing for simulations. sufficient knowledge in the overlapping regimes of
business needs, domain knowledge, analytical skills,
Big Data consists of extensive datasets—primarily and software and systems engineering to manage the
in the characteristics of volume, variety, velocity, and/or end-to-end data processes in the data life cycle [22].
variability—that require a scalable architecture for
efficient storage, manipulation, and analysis [22]. The NIST definition clarifies that the process activities for
The paradigm shift to use scalable technologies in this the design and execution of data science for BDA must involve
context means there are new activities that must be formed in domain (data and process) experts; statistics, data mining, and
architecting and executing an end-to-end data science project. machine learning experts; software and systems engineers; and
individuals who understand the mission for efficient execution.
B. Distribution of System Components and Roles A Venn diagram [24] was suggested to represent the overlap of
The NIST Big Data Reference Architecture (NBDRA)[23] required skills for data science. Fig. 31 shows the overlapping
shown in Fig. 2 has the traditional components of a KDD skills needed for Big data systems. While no single person may
process - the data provider, the application, and data (or be skilled in all these areas, a team must encompass all these
analytics) consumer. The additional roles are in the I&G need skills, with each member familiar enough in the other
for a system orchestrator, and the depiction of the explicit disciplines to communicate across the team.
The term scientist is used since the activities are similar to
the other sciences – hypothesis generation, design of the
experimental process and (computational) equipment,
parallelization of the data and the algorithms, evaluation of
methods and validation of their results, deployment of the
system and the communication of findings.

Fig. 2. NIST Big Data Reference Architecture (NBDRA) Fig. 3. Mix of Skills for Data Scientists

1
This diagram is an extension of the diagram in [22], which was contributed
by this author to that NIST document

1604
III. KDDS PHASES
DSE as shown in Fig. 4 as an end-to-end process model
from mission needs planning to the delivery of value. I&G
systems have greater complexity in that the project lifecycle is
typically executed in four distinct phases.
A. Assess
The first phase is a planning, analysis of alternatives, and
rough order of magnitude estimation process. Will the results
of the proposed system be actionable by the organization? Does
Fig. 4. Data Science Edge phases, activities, and data maturity
it represent a return on investment? Are there regulatory or
policy issues that would hinder any of the alternatives or the
ability to obtain the needed data. Will the activities and results is best to group the activities into stages based upon the
be acceptable to all the organizations stakeholders? Will the maturity of the data and the types of processing associated with
organization have the staff skills to operate the system? getting the data to reach that stage of maturity. In the case of
For I&G projects this phase typically occurs prior to the large volume systems, the curated data may not be stored.
decision to fund the project and to commit resources. It includes Instead the curation is performed dynamically on the raw data
the requirements and definition of any critical success factors. as the data is pulled for the analytics. Just as in CRISP-DM,
each stage can result in a return to a previous stage as a result
B. Architect the evaluation at that stage.
The second phase consists of a translation of the DSE consists of five process stages: 1) Plan, 2) Collect, 3)
requirements into a solution for a new system; or to address the Curate, 4) Analyze, and 5) Act. Each stage results in the
gaps from the current state to the final state that will satisfy the generation and potential storage and usage of data at increasing
requirements for an enhanced system. The solution may levels of maturity: 1) goals or solution architecture and
implicitly or explicitly include an analysis of the alternatives requirements, 2) original data as collected, 3) contextualized
for the final solution components The solution must consider and organized information, 4) synthesized and useful
the operational concerns for the deployment of the system, knowledge of 5) measurable value to the organization. The
much as devops is a recognition in software development of the three distinctive types of visualization are also represented, for
need to include deployment concerns early in the process. 1) the exploration of potentially large raw datasets, 2) the
C. Build visualization for the evaluation of analytics results, and 3) the
streamlined explanation of information so a decision-maker can
The second phase consists of the development, test, and grasp the critical results quickly and accurately. Each of these
deployment of the technical solution. This stage differs from three types of visualization require different techniques and
the normal software development lifecycle in that issues with practitioner skills.
the data may cause additional unexpected cleansing and
transformation activities, or the system may function but
deliver results with insufficient accuracy requiring additional V. NEW KDDS ACTIVITIES
modeling or the return to design to choose new models. A quick comparison shows CRISP-DM and DSE stages are
D. Improve well aligned, but the activities within each stage have been
significantly reordered in and between stages, and a number of
The final phase consists of the operation and management
new activities have been added. While DSE incorporates
of the system as well as an analysis of innovative ways the
management, requirements generation and software
system performance could be improved. This process centers
development processes, only the additional activities related to
on the gaps in the system performance (speed, accuracy, etc).
big data concerns are presented in this paper. The new activities
These gaps can be considered new mission needs, for a return
are described within their stages.
to the Assess phase to begin again.
A. Plan Stage
IV. KDDS PROCESS MODEL
1) Defining organizational boundaries
DSE is a continually evolving process model encapsulating Analogously to cloud architectures, there can be multiple
lessons learned over years of customer engagements. At a high organizations involved. For example, the physical
level DSE appears to only differ slightly from prior process infrastructure can be from an external cloud service provider
models by organizing around the state of the data. CRISP-DM while the system can leverage a platform-as-a-service or even
assumed a process model where the data was staged for Data an analytics-as-a-service vendor hosting service. In another
Understanding and stored after the Data Preparation, based on example, data obtained from an external organization may not
the traditional data warehouse model. Since the storage of the be cleared for ingestion in another party's system. This would
data is ordered in the process according to the “V” require additional tasks to not only coordinate efforts with other
characteristics of the data – such as volume storing the data only organizations but also to accommodate any new contractual
once in its original raw state or velocity not storing it at all – it constraints.

1605
2) Justifying the investment datasets with large numbers of attributes, rather than large
Many of our customers have existing systems and staff numbers of records. Given the immensity of the data, new
members with experience using existing technologies for visual approaches may be needed to understand the quality,
systems development. Big data technologies can represent a statistical distribution, and potential features in very large
revolutionary change to standard business practices and datasets where summarization and full-scan statistics are not
policies. As these efforts often leverage open source tools and possible. Explicatory Visualization is used to evaluate both
technologies, they can raise security concerns for some curation and analysis activities.
organizations. Organizations also must determine if the
expense in tools, training, and time is justified based on an 2) Privacy and data fusion
estimate of the additional value to be generated. This adds There are a number of laws such as Health Insurance
return on investment estimate and organization readiness tasks Portability and Accountability Act (HIPAA) that apply for
prior to the generation of recommended goals. industries in specific domains, in this case for personal health
records. There is an emergent behavior known as the Mosaic
3) Policy and governance Effect, where datasets that do not independently represent
While CRISP-DM was generated by a consortium of privacy concerns can create personally identifiable information
industrial participants, there was little attention given to the (PII) when fused [24]. This has become increasingly important
concerns of regulation and policy – concerns that are critical to in big data analytics. This problem is not new to big data, but is
our government customers. For example, there are cases where clearly exacerbated by large volumes of social media
additional external data would be extremely valuable to information (e.g., determine your friends, school, and
contextualize the data for an organization, but data sharing employment history), smartphone location data (can easily
regulations would not allow their inclusion. infer where you work and live), and health information from the
correlation of DNA or electronic health records against a pool
of “anonymized information”. Additional care must be taken to
B. Collect Stage analyze the expected results of the fusion of datasets to ensure
1) Databases there will be no unintended consequences from data fusion,
Traditionally it was assumed that collected data would be including generation of PII as well as potential changes in the
staged, then, after an ETL process, subsequently stored in a security classification of the data.
relational database. With the advent of the NoSQL technologies
as well as the newly named NewSQL technologies for volume 3) Distributed Data Repositories
big data systems, there are a number of choices to consider prior Gartner uses the term Logical Data Warehouse (LDW) to
to the collection of data. In the case of high velocity systems, indicate the treatment of multiple distributed repositories as a
the choice revolves around the design of in-memory data single entity. Consideration must be given to the performance
management. of a LDW approach versus the migration of the data to a single
repository for curation and analysis. This is most common in
2) Data distribution the variety case of big data, since the differences in datasets
The heart of volume big data is the distribution of data across may dictate placing them in separate logical models.
independent nodes. Something must be known about the data’s
characteristics in order to distribute that data across the 4) Metadata
independent nodes to properly load balance the analytics that For many years domain expertise has been required when
will be run on each node. performing data mining on re-purposed data (i.e., data that was
created for another purpose). For repurposed data to be machine
3) Dynamic retrieval actionable, sufficient metadata must be available to ensure the
Given the involvement of multiple organizations, it is proper use of the data elements. Proper curation may not be
sometimes not possible to ingest a dataset. Rather, the data possible without semantic metadata about the elements and
remains with the data provider, and either query access is clear provenance or history on the entire dataset.
negotiated or code is sent to perform the curation and analytics
on the data provider’s resources. 5) Data quality
For data warehouses, the Curate stage includes a set of
C. Curate Stage activities designed to cleanse the data and potentially remove
outliers that would adversely affect analysis. In some volume
1) Exploratory Visualization circumstances the errors may become insignificant for actions
Visualization is typically discussed as one monolithic only requiring aggregated data. Thus additional consideration
endeavor – the visual presentation of data for improved human must be given to what extent the data quality concerns will
understanding. There are, in fact, three distinct visualization affect the analytics and the actions that will result based on
activities that involve very different tools and skillsets. deployment. In addition, data quality becomes a concern again
Exploratory Visualization is the attempt to browse the collected in the specific techniques that can be used for very large or high
data to gain insights for cleansing or for the determination of velocity datasets.
the distribution across nodes. Prior techniques had focused on

1606
6) Sampling
Traditionally careful sampling was used to make analytics
tractable – solvable in the allowed time window. Big data
systems can present some challenges: 1) the over-arching data
distribution may not be known, making stratified sampling
difficult, 2) full scan statistics are sometimes not possible. The
scalability of NoSQL platforms can make sampling potentially
less critical so sampling must now be considered to see to what
extent it is required or whether it is required at all.
Fig. 5. Distributed Asymmetric Analytics

D. Analyze Stage
intensive applications when they are not “embarrassingly
1) Correlation versus causation parallel” computations.
Previously in the world of statistics, data were carefully
collected and/or sampled to obtain a precise result. When the 7) Distributed asymmetric analytics
data mining community came together, they determined that an There is another case where the analytics are not
approximate model could provide the information needed. embarrassingly parallel. With the IoT there is a tradeoff
Since then, the expectations for precision in data science have between data transmission from a sensor and the bandwidth
changed, and under certain circumstances even merely limitations for data transmission to a host as well as the
identifying a trend is now sufficient. differences in the compute power of differing devices, as
illustrated in Fig. 5. Given the limitations on communication
2) Hypothesis generation bandwidth, analytics must be planned for each type of device to
A major concern for data miners has always been the reduce the detrimental impact of the communication
expectations of non-data miners that simply taking a dataset and requirements and meet the end-to-end mission goals.
exploring it can reveal something useful. In the context of agile
data science, the expectation is that the first iteration is the early
exploratory to generate a hypothesis and that the second E. Act Stage
iteration will work to confirm or deny that hypothesis. While a discussion of how the NBDRA Data Consumer
might leverage analytic results offered by other systems is
3) Simpler questions outside the scope of this KDDS process model, there are two
In some cases an initial estimation of the needed technical activities to emphasize.
data processing and analytics turns out to be impractical for
implementation. Any analytics effort should reassess the 1) Explanatory Visualization
technical approach to determine if there are simpler ways to With the increasing complexity of datasets and analytics,
analyze surrogate questions that could be answered and still attention must be given to how the results are to be
meet the mission requirements. communicated (e.g., the optimal way to communicate the
complexity of the analysis, the uncertainty in the process stages,
4) Distributed algorithms and the precision of the results). Explanatory visualization – also
Initial BDA focused on MapReduce, where the analytics termed information visualization – has become even more
where fairly simple (such as a query or aggregation). Most critical to be able to communicate probabilities and uncertainties
to often non-technical decision-makers. A new term for getting
machine learning algorithms are not however so
data to a manageable amount is “small data”, and the techniques
“embarrassingly parallel” where the same operation can be run to communicate results of small data analysis include the
independently across each data subset. The choice of algorithm creation of infographics or interactive dashboards. This activity
may be affected by the ability to parallelize it to run at scale. relates to human factors as well as graphic design.
5) Concurrency 2) Implicit security and classification
NoSQL databases do not follow the same principles as Security is not a new concern to I&G customers but the
relational databases, so data updates must propagate through a potential distribution of different system aspects across
cluster. The possibility that the data may not have been updated multiple organizations has opened up additional activities
on all nodes may lead to inconsistent analytics results – so necessary to ensure the protection of big data systems. In
concurrency must be explicitly considered. addition, for our government, defense, and intelligence
customers, classified big data systems have further
6) Latency requirements on the exchange of data between organizations or
Horizontal scaling also introduces latency concerns, if the system components running at different classification levels.
analytics on each node are not, in fact, independent and one This again makes it clear that analytics applications and the
process is waiting on another to finish. This concern is choice of processing, platform, and infrastructure resources
analogous to the message-passing concerns for compute- cannot be made outside the context of the expected deployment
of those resources.

1607
VI. CONCLUSIONS [8] K. Cios and L. Kurgan, “Trends in Data Mining and Knowledge
Discovery”, in Advances in Knowledge Discovery and Data Mining, U.
This paper has discussed at a conceptual level the many M. Fayyad, G. Piatetsky-Shaprio, P. Smyth, and R Uthurusamy, Eds.,
activities that have been added to our KDDS process model that AAAI/MIT Press, 1996.
we use in our customer engagements. The focus has been on the [9] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer,
and R. Wirth. CRISP-DM 1.0 - Step-by-step data mining guide. The
new activities needed for the design and development of big CRISP-DM Consortium, 2000.
data systems. Some tasks naturally overlap with other [10] G. Piatetsky-Shapiro, “CRISP-DM, still the top methodology for
disciplines (such as mission portfolio analysis, business process analytics, datamining, or data science projects”,
re-engineering, software and systems development, business http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-
analytics-data-mining-data-science-projects.html, 2014, accessed online
analytics, project management, and data-driven decision- July 26, 2016.
making), but the tasks discussed here directly relate to the [11] S. Sharma and K. Osei-Bryson, “An Integrated Knowledge Discovery and
efficient and accurate delivery of value from data science and Data Mining Process Model,” in Knowledge Discovery Process and
Methods, K.-M. Osei-Bryson and C. Barclay, Eds. 2015.
applications involving big data.
[12] S. Sharma and K. Osei-Bryson, “A Novel Method for Formualting the
The breadth of new activities in DSE over CRISP-DM Business Objectives of Data Mining Projects,” in Knowledge Discovery
establishes that there are significant differences in the execution Process and Methods, K.-M. Osei-Bryson and C. Barclay, Eds. 2015.
of data science over scalable big data systems from the tasks [13] M. Alnoukari, Z. Alzoabi and S. Hanna, "Applying adaptive software
presented in prior KDD process models. development (ASD) agile modeling on predictive data mining
applications: ASD-DM methodology," 2008 International Symposium on
VII. FUTURE DIRECTIONS Information Technology, Kuala Lumpur, Malaysia, 2008, pp. 1–6.
[14] Ó. Marbán, J. Segovia, E. Menasalvas, and C. Fernández-Baizán,
DSE continues to evolve from lessons learned from each “Toward data mining engineering: A software engineering approach,”
engagement with our customers, and from alignment with other Information Systems, vol. 34, no. 1, pp. 87–107, Mar. 2009.
evolving mission, software, and management process models. [15] G. Nascimento and A. Oliveira, “AgileKDD,” presented at the Second
International Conference on Advances in Information Mining and
Given the massive changes in the techniques and Management, 2012, pp. 118–122.
technologies of big data, this paper also serves as a call for the [16] E. Ridge, Guerrilla Analytics: A Practical Approach to Working with
community to begin the work necessary to create a standard Data, Morgan-Kaufmann, 2014.
process methodology for KDDS, which perhaps could be named [17] M. Charest, S. Delisle, O. Cervantes and Yanfen Shen, "Invited Paper:
the Cross-Industry Standard Process Model for Data Science Intelligent Data Mining Assistance via CBR and Ontologies," 17th
(CRISP-DS). International Workshop on Database and Expert Systems Applications
(DEXA'06), Krakow, 2006, pp. 593–597.
ACKNOWLEDGMENT [18] G. Mariscal, Ó. Marbán, and C. Fernández, “A survey of data mining and
knowledge discovery process models and methodologies,” The
The author thanks the NIST Public Big Data Working Group Knowledge Engineering Review, vol. 25, no. 2, pp. 137–166, Jun. 2010.
for discussions of advancements in big data; work with [19] L. A. Kurgan and P. Musilek, “A survey of Knowledge Discovery and
colleagues at SAIC on the development and deployed usage of Data Mining process models,” The Knowledge Engineering Review, vol.
the DSE process model across our customer base; and specific 21, no. 1, pp. 1–24, Jul. 2006.
review and suggestions from colleagues Mini Kanwal, Sanjay [20] A. Jaokar, “A methodology for solving problems with DataScience for
Sardar, and Prachi Sukhatankar. Internet of Things”,
http://www.datasciencecentral.com/profiles/blog/show?id=6448529%3
REFERENCES ABlogPost%3A450284, posted July 21, 2016 and
http://www.datasciencecentral.com/profiles/blogs/a-methodology-for-
solving-problems-with-datascience-for-interne-1, posted July 25, 2016,
[1] A. Azevedo and M. F. Santos, “KDD, SEMMA AND CRISP-DM,” accessed July 28, 2016.
presented at the IADIS European Conference Data Mining, 2008, pp. [21] J. Saltz, “The Need for New Processes, Methodlogies and Tools to
182–185. Support Big Data Teams and Improve Big Data Project Effectiveness”,
[2] SAS Institute, “SAS Enterprise Miner SEMMA”. First Annual Workshop on Methodologies and Tools to improve Big Data
[3] G. Piatetsky-Shapiro, “Knowledge Discovery in Real Databases: A Projects, at 2015 IEEE International Conference on Big Data, Santa Clara,
Report on the IJCAI-89 Workshop,” AI Magazine, vol. 11, no. 5, pp. 1– CA, 2015
3, Feb. 1990. [22] NIST Big Data Public Working Group, “NIST Big Data Interoperability
[4] K.-M. Osei-Bryson and C. Barclay, “Introduction,” in Knowledge Framework: Volume 1, Definitions,” N Grady and W Chang, Eds, NIST,
Discovery Process and Methods, K.-M. Osei-Bryson and C. Barclay, Eds. Special Publication 1500-1, Sep. 2015.
2015. [23] NIST Big Data Public Working Group, “NIST Big Data Interoperability
[5] U. M. Fayyad, G. Piatetsky-Shaprio, P. Smyth, “From Data Mining to Framework: Volume 6, Reference Architecture,” O. Levin, D. Boyd, and
Knowledge Discovery: An Overview” in Advances in Knowledge W Chang, Eds, NIST, Special Publication 1500-6, Sep. 2015.
Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shaprio, P. [24] D. Conway, “The Data Science Venn Diagram”,
Smyth, and R Uthurusamy, Eds., AAAI/MIT Press, 1996. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram,
[6] P. Cabena et. al., Discovering Data Mining: From Concepts to published March 2013, accessed July 28, 2016.
Implementation, Prentice hall, Upper Saddle River, NJ, 1997. [25] A. Mazmanian, “The mosaic effect and big data”,
[7] M. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales https://fcw.com/Articles/2014/05/13/fose-mosaic.aspx, published April
and Customer support, John Wiley & Sons, New York, 1997. 2014, accessed 28 July 2016.

1608

BP Nutanix Physical Networking
100% (1)
BP Nutanix Physical Networking
24 pages
Foundations of Breach & Attack Simulation: Student Guide
No ratings yet
Foundations of Breach & Attack Simulation: Student Guide
26 pages
Jane Eyre: Charlotte Brontë
No ratings yet
Jane Eyre: Charlotte Brontë
13 pages
Carissa Veliz. Privacy Is Power. Why and How You
No ratings yet
Carissa Veliz. Privacy Is Power. Why and How You
3 pages
FIRST LESSON PLAN - Grade 8
No ratings yet
FIRST LESSON PLAN - Grade 8
9 pages
GDPRterritorial Scope of Application - FEDERICO MARENGO
No ratings yet
GDPRterritorial Scope of Application - FEDERICO MARENGO
1 page
Salesforce Single Sign On
No ratings yet
Salesforce Single Sign On
42 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
MDM 103HF1 UpgradingFromVersion10x en PDF
No ratings yet
MDM 103HF1 UpgradingFromVersion10x en PDF
168 pages
UGRD CS6306 Unified Functional Testing SourcePrelim Quiz 1 Prelim Quiz 2 Midterm Quiz 1 Midterm Quiz 2 Final Quiz 1 Final Quiz 2
No ratings yet
UGRD CS6306 Unified Functional Testing SourcePrelim Quiz 1 Prelim Quiz 2 Midterm Quiz 1 Midterm Quiz 2 Final Quiz 1 Final Quiz 2
9 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Que Stio N No Type (MC Q/SAT) CO Mapp Ing Answer Key
No ratings yet
Que Stio N No Type (MC Q/SAT) CO Mapp Ing Answer Key
24 pages
Operation Manual: Distributed By: Pro Pack Solutions, Inc., 2421B Lance Court, Loganville, GA 30052 USA
No ratings yet
Operation Manual: Distributed By: Pro Pack Solutions, Inc., 2421B Lance Court, Loganville, GA 30052 USA
48 pages
Fireeye Email Security Server Edition: Data Sheet
No ratings yet
Fireeye Email Security Server Edition: Data Sheet
7 pages
BD1 1
0% (1)
BD1 1
9 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Week-1 Introduction To BDDA-TWM PDF
No ratings yet
Week-1 Introduction To BDDA-TWM PDF
48 pages
Koppu Eshwar Aug32
No ratings yet
Koppu Eshwar Aug32
1 page
Chapter-1-2, EMC DSA Notes
No ratings yet
Chapter-1-2, EMC DSA Notes
8 pages
Juniper Networks Certified Design Specialist - Data Center (JNCDS-DC)
No ratings yet
Juniper Networks Certified Design Specialist - Data Center (JNCDS-DC)
6 pages
Rest Api in Drupal 8
No ratings yet
Rest Api in Drupal 8
24 pages
Data Science
No ratings yet
Data Science
6 pages
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
3 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Yealink SIP-T19P E2 Datasheet
No ratings yet
Yealink SIP-T19P E2 Datasheet
2 pages
Towards Methods For Systematic Research On Big Data
No ratings yet
Towards Methods For Systematic Research On Big Data
10 pages
Chapter 1 - Introduction: Agile Big Data Analytics
No ratings yet
Chapter 1 - Introduction: Agile Big Data Analytics
30 pages
Towards Methods For Systematic Research On Big Data
No ratings yet
Towards Methods For Systematic Research On Big Data
10 pages
AIDN002012010067 RADMSKBig Data Military Informationand Intel
No ratings yet
AIDN002012010067 RADMSKBig Data Military Informationand Intel
9 pages
Vickie Data Analytics
No ratings yet
Vickie Data Analytics
9 pages
I. Read The Following Sentences and Decide Whether They Are TRUE or FALSE (1pt)
No ratings yet
I. Read The Following Sentences and Decide Whether They Are TRUE or FALSE (1pt)
4 pages
A Guide For Beginners: Big Data Glossary
No ratings yet
A Guide For Beginners: Big Data Glossary
1 page
Lecture 07 - Key-Value Databases
No ratings yet
Lecture 07 - Key-Value Databases
75 pages
Exam EX0-001: ITIL Foundation (Syllabus 2011)
No ratings yet
Exam EX0-001: ITIL Foundation (Syllabus 2011)
25 pages
Data Science Essay
No ratings yet
Data Science Essay
2 pages
BacLink 1.getting Started
No ratings yet
BacLink 1.getting Started
4 pages
Classroom Assignment 2
No ratings yet
Classroom Assignment 2
3 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Track 3 - Hands-On With XOG, REST APIs & GEL Scripting
No ratings yet
Track 3 - Hands-On With XOG, REST APIs & GEL Scripting
166 pages
Applying GDPR Roles and Responsibilities To Scientic Data Sharing
No ratings yet
Applying GDPR Roles and Responsibilities To Scientic Data Sharing
13 pages
Intra STO Process 5
No ratings yet
Intra STO Process 5
14 pages
Project Scope Management
No ratings yet
Project Scope Management
48 pages
Guideline+on+Effectively+Managing+Security+Service+in+the+Cloud+10 2 18
No ratings yet
Guideline+on+Effectively+Managing+Security+Service+in+the+Cloud+10 2 18
53 pages
Amazon Web Services: December 2020
No ratings yet
Amazon Web Services: December 2020
6 pages
Module 1
No ratings yet
Module 1
29 pages
67031-Data Science As Service
No ratings yet
67031-Data Science As Service
8 pages
Data Science
No ratings yet
Data Science
40 pages
Chapter 2 HDFS and ZooKeeper
No ratings yet
Chapter 2 HDFS and ZooKeeper
45 pages
Unit 1
No ratings yet
Unit 1
28 pages
Cassandra
No ratings yet
Cassandra
18 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Bigdata Unit1
No ratings yet
Bigdata Unit1
62 pages
Designing User Interfaces For Users
No ratings yet
Designing User Interfaces For Users
12 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Big Data
No ratings yet
Big Data
4 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Module 1
No ratings yet
Module 1
35 pages
BC2017
No ratings yet
BC2017
28 pages
Ijcrcst January17 12
No ratings yet
Ijcrcst January17 12
4 pages
CH 1
No ratings yet
CH 1
34 pages
PYTHON
No ratings yet
PYTHON
5 pages
AWS Simple Icons
No ratings yet
AWS Simple Icons
24 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Chapter 2 - Overview For Data Science
No ratings yet
Chapter 2 - Overview For Data Science
31 pages
1st Slides
No ratings yet
1st Slides
60 pages
Introduction To Data Analytics: Roberta Turra
No ratings yet
Introduction To Data Analytics: Roberta Turra
23 pages
BCE Report
No ratings yet
BCE Report
14 pages
1) Introduction To Big Data
No ratings yet
1) Introduction To Big Data
6 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
Big Data
No ratings yet
Big Data
106 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Project Proposal - CIT 490
No ratings yet
Project Proposal - CIT 490
2 pages
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
No ratings yet
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
4 pages
Cloudnative Application Architecture Microservice Development Best Practice 1st Edition Freewheel Bizui Team PDF Download
No ratings yet
Cloudnative Application Architecture Microservice Development Best Practice 1st Edition Freewheel Bizui Team PDF Download
84 pages
Data Mining, Data Wharehousing and Olap
No ratings yet
Data Mining, Data Wharehousing and Olap
33 pages
7835 English
No ratings yet
7835 English
8 pages
Topic 02 - Data Mining For Business Intelligence
No ratings yet
Topic 02 - Data Mining For Business Intelligence
45 pages
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Big Data for Enterprise Architects
From Everand
Big Data for Enterprise Architects
Dr Mehmet Yildiz
4.5/5 (3)
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Scalable Data Pipelines: Architecting For The Petabyte Era
From Everand
Scalable Data Pipelines: Architecting For The Petabyte Era
Oreoluwa Adebayo
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
From Everand
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
William Smith
No ratings yet

Knowledge Discovery in Data Science: KDD Meets Big Data

Uploaded by

Knowledge Discovery in Data Science: KDD Meets Big Data

Uploaded by

2016 IEEE International Conference on Big Data (Big Data)

Knowledge Discovery in Data Science

This work was supported on SAIC internal Research and Development.

You might also like