Data Science Foundations Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics 1st Edition Fionn Murtagh

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Download the full version of the textbook now at textbookfull.

com

Data Science Foundations Geometry and


Topology of Complex Hierarchic Systems and
Big Data Analytics 1st Edition Fionn Murtagh

https://textbookfull.com/product/data-science-
foundations-geometry-and-topology-of-complex-
hierarchic-systems-and-big-data-analytics-1st-
edition-fionn-murtagh/

Explore and download more textbook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

It's All Analytics!: The Foundations of AI, Big Data, and


Data Science Landscape for Professionals in Healthcare,
Business, and Government Scott Burk
https://textbookfull.com/product/its-all-analytics-the-foundations-of-
ai-big-data-and-data-science-landscape-for-professionals-in-
healthcare-business-and-government-scott-burk/
textbookfull.com

Big Data Analytics Systems Algorithms Applications C.S.R.


Prabhu

https://textbookfull.com/product/big-data-analytics-systems-
algorithms-applications-c-s-r-prabhu/

textbookfull.com

From Big Data to Big Profits Success with Data and


Analytics 1st Edition Russell Walker

https://textbookfull.com/product/from-big-data-to-big-profits-success-
with-data-and-analytics-1st-edition-russell-walker/

textbookfull.com

Urban Operating Systems Producing the Computational City


1st Edition Andres Luque-Ayala

https://textbookfull.com/product/urban-operating-systems-producing-
the-computational-city-1st-edition-andres-luque-ayala/

textbookfull.com
Progressive Web Apps with React Create Lightning Fast Web
Apps With Native Power Using React and Firebase 1st
Edition Scott Domes
https://textbookfull.com/product/progressive-web-apps-with-react-
create-lightning-fast-web-apps-with-native-power-using-react-and-
firebase-1st-edition-scott-domes/
textbookfull.com

The Frozen Prince: 2 (The Beast Charmer) 1st Edition Maxym


M. Martineau [Martineau

https://textbookfull.com/product/the-frozen-prince-2-the-beast-
charmer-1st-edition-maxym-m-martineau-martineau/

textbookfull.com

Advances on Broad-Band Wireless Computing, Communication


and Applications: Proceedings of the 15th International
Conference on Broad-Band and Wireless Computing,
Communication and Applications (BWCCA-2020) Leonard
https://textbookfull.com/product/advances-on-broad-band-wireless-
Barolli
computing-communication-and-applications-proceedings-of-the-15th-
international-conference-on-broad-band-and-wireless-computing-
communication-and-applications-bwcca/
textbookfull.com

Biotechnology: Recent Trends and Emerging Dimensions 1st


Edition Atul Bhargava

https://textbookfull.com/product/biotechnology-recent-trends-and-
emerging-dimensions-1st-edition-atul-bhargava/

textbookfull.com

Cost Engineering Health Check How Good are Those Numbers


1st Edition Dale Shermon

https://textbookfull.com/product/cost-engineering-health-check-how-
good-are-those-numbers-1st-edition-dale-shermon/

textbookfull.com
History of the world Map by map 1st Edition Snow Peter

https://textbookfull.com/product/history-of-the-world-map-by-map-1st-
edition-snow-peter/

textbookfull.com
DATA SCIENCE
FOUNDATIONS
Geometry and Topology
of Complex Hierarchic Systems
and Big Data Analytics
Chapman & Hall/CRC
Computer Science and Data Analysis Series

The interface between the computer and statistical sciences is increasing, as each
discipline seeks to harness the power and resources of the other. This series aims to
foster the integration between the computer sciences and statistical, numerical, and
probabilistic methods by publishing a broad range of reference works, textbooks, and
handbooks.

SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London

Proposals for the series should be sent directly to one of the series editors above, or submitted to:

Chapman & Hall/CRC


Taylor and Francis Group
3 Park Square, Milton Park
Abingdon, OX14 4RN, UK

Published Titles

Semisupervised Learning for Computational Linguistics


Steven Abney

Visualization and Verbalization of Data


Jörg Blasius and Michael Greenacre

Design and Modeling for Computer Experiments


Kai-Tai Fang, Runze Li, and Agus Sudjianto

Microarray Image Analysis: An Algorithmic Approach


Karl Fraser, Zidong Wang, and Xiaohui Liu

R Programming for Bioinformatics


Robert Gentleman

Exploratory Multivariate Analysis by Example Using R


François Husson, Sébastien Lê, and Jérôme Pagès

Bayesian Artificial Intelligence, Second Edition


Kevin B. Korb and Ann E. Nicholson
Published Titles cont.

®
Computational Statistics Handbook with MATLAB , Third Edition
Wendy L. Martinez and Angel R. Martinez

Exploratory Data Analysis with MATLAB , Third Edition


®

Wendy L. Martinez, Angel R. Martinez, and Jeffrey L. Solka

Statistics in MATLAB®: A Primer


Wendy L. Martinez and MoonJung Cho

Clustering for Data Mining: A Data Recovery Approach, Second Edition


Boris Mirkin

Introduction to Machine Learning and Bioinformatics


Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis

Introduction to Data Technologies


Paul Murrell

R Graphics
Paul Murrell

Correspondence Analysis and Data Coding with Java and R


Fionn Murtagh

Data Science Foundations: Geometry and Topology of Complex Hierarchic


Systems and Big Data Analytics
Fionn Murtagh

Pattern Recognition Algorithms for Data Mining


Sankar K. Pal and Pabitra Mitra

Statistical Computing with R


Maria L. Rizzo

Statistical Learning and Data Science


Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh,
Catherine Pardoux, and Myriam Touati

Music Data Analysis: Foundations and Applications


Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph

Foundations of Statistical Algorithms: With References to R Packages


Claus Weihs, Olaf Mersmann, and Uwe Ligges
Chapman & Hall/CRC
Computer Science and Data Analysis Series

DATA SCIENCE
FOUNDATIONS
Geometry and Topology
of Complex Hierarchic Systems
and Big Data Analytics

Fionn Murtagh

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170823

International Standard Book Number-13: 978-1-4987-6393-6 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,
MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of
users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been
arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xiii

I Narratives from Film and Literature, from Social Media and


Contemporary Life 1

1 The Correspondence Analysis Platform for Mapping Semantics 3


1.1 The Visualization and Verbalization of Data . . . . . . . . . . . . . . . . . 3
1.2 Analysis of Narrative from Film and Drama . . . . . . . . . . . . . . . . . 4
1.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 The Changing Nature of Movie and Drama . . . . . . . . . . . . . . 4
1.2.3 Correspondence Analysis as a Semantic Analysis Platform . . . . . . 5
1.2.4 Casablanca Narrative: Illustrative Analysis . . . . . . . . . . . . . . 5
1.2.5 Modelling Semantics via the Geometry and Topology of Information 6
1.2.6 Casablanca Narrative: Illustrative Analysis Continued . . . . . . . . 8
1.2.7 Platform for Analysis of Semantics . . . . . . . . . . . . . . . . . . . 8
1.2.8 Deeper Look at Semantics of Casablanca: Text Mining . . . . . . . . 10
1.2.9 Analysis of a Pivotal Scene . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Application of Narrative Analysis to Science and Engineering Research . . 11
1.3.1 Assessing Coverage and Completeness . . . . . . . . . . . . . . . . . 12
1.3.2 Change over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Conclusion on the Policy Case Studies . . . . . . . . . . . . . . . . . 15
1.4 Human Resources Multivariate Performance Grading . . . . . . . . . . . . 19
1.5 Data Analytics as the Narrative of the Analysis Processing . . . . . . . . . 21
1.6 Annex: The Correspondence Analysis and Hierarchical Clustering Platform 21
1.6.1 Analysis Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.2 Correspondence Analysis: Mapping χ2 Distances into Euclidean Dis-
tances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.3 Input: Cloud of Points Endowed with the Chi-Squared Metric . . . . 22
1.6.4 Output: Cloud of Points Endowed with the Euclidean Metric in Factor
Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.5 Supplementary Elements: Information Space Fusion . . . . . . . . . 23
1.6.6 Hierarchical Clustering: Sequence-Constrained . . . . . . . . . . . . 24

2 Analysis and Synthesis of Narrative: Semantics of Interactivity 25


2.1 Impact and Effect in Narrative: A Shock Occurrence in Social Media . . . 25
2.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.2 Two Critical Tweets in Terms of Their Words . . . . . . . . . . . . . 26
2.1.3 Two Critical Tweets in Terms of Twitter Sub-narratives . . . . . . . 26
2.2 Analysis and Synthesis, Episodization and Narrativization . . . . . . . . . 32
2.3 Storytelling as Narrative Synthesis and Generation . . . . . . . . . . . . . 33

vii
viii Contents

2.4 Machine Learning and Data Mining in Film Script Analysis . . . . . . . . . 35


2.5 Style Analytics: Statistical Significance of Style Features . . . . . . . . . . 36
2.6 Typicality and Atypicality for Narrative Summarization and Transcoding . 37
2.7 Integration and Assembling of Narrative . . . . . . . . . . . . . . . . . . . 40

II Foundations of Analytics through the Geometry and Topol-


ogy of Complex Systems 43

3 Symmetry in Data Mining and Analysis through Hierarchy 45


3.1 Analytics as the Discovery of Hierarchical Symmetries in Data . . . . . . . 45
3.2 Introduction to Hierarchical Clustering, p-Adic and m-Adic Numbers . . . 45
3.2.1 Structure in Observed or Measured Data . . . . . . . . . . . . . . . 46
3.2.2 Brief Look Again at Hierarchical Clustering . . . . . . . . . . . . . . 46
3.2.3 Brief Introduction to p-Adic Numbers . . . . . . . . . . . . . . . . . 47
3.2.4 Brief Discussion of p-Adic and m-Adic Numbers . . . . . . . . . . . 47
3.3 Ultrametric Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Ultrametric Space for Representing Hierarchy . . . . . . . . . . . . . 48
3.3.2 Geometrical Properties of Ultrametric Spaces . . . . . . . . . . . . . 48
3.3.3 Ultrametric Matrices and Their Properties . . . . . . . . . . . . . . 48
3.3.4 Clustering through Matrix Row and Column Permutation . . . . . . 50
3.3.5 Other Data Symmetries . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Generalized Ultrametric and Formal Concept Analysis . . . . . . . . . . . . 52
3.4.1 Link with Formal Concept Analysis . . . . . . . . . . . . . . . . . . 52
3.4.2 Applications of Generalized Ultrametrics . . . . . . . . . . . . . . . . 54
3.5 Hierarchy in a p-Adic Number System . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 p-Adic Encoding of a Dendrogram . . . . . . . . . . . . . . . . . . . 54
3.5.2 p-Adic Distance on a Dendrogram . . . . . . . . . . . . . . . . . . . 57
3.5.3 Scale-Related Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Tree Symmetries through the Wreath Product Group . . . . . . . . . . . . 58
3.6.1 Wreath Product Group for Hierarchical Clustering . . . . . . . . . . 58
3.6.2 Wreath Product Invariance . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.3 Wreath Product Invariance: Haar Wavelet Transform of Dendrogram 60
3.7 Tree and Data Stream Symmetries from Permutation Groups . . . . . . . . 62
3.7.1 Permutation Representation of a Data Stream . . . . . . . . . . . . 62
3.7.2 Permutation Representation of a Hierarchy . . . . . . . . . . . . . . 63
3.8 Remarkable Symmetries in Very High-Dimensional Spaces . . . . . . . . . 64
3.9 Short Commentary on This Chapter . . . . . . . . . . . . . . . . . . . . . . 65

4 Geometry and Topology of Data Analysis: in p-Adic Terms 69


4.1 Numbers and Their Representations . . . . . . . . . . . . . . . . . . . . . . 69
4.1.1 Series Representations of Numbers . . . . . . . . . . . . . . . . . . . 69
4.1.2 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 p-Adic Valuation, p-Adic Absolute Value, p-Adic Norm . . . . . . . . . . . 71
4.3 p-Adic Numbers as Series Expansions . . . . . . . . . . . . . . . . . . . . . 72
4.4 Canonical p-Adic Expansion; p-Adic Integer or Unit Ball . . . . . . . . . . 73
4.5 Non-Archimedean Norms as p-Adic Integer Norms in the Unit Ball . . . . 74
4.5.1 Archimedean and Non-Archimedean Absolute Value Properties . . . 74
4.5.2 A Non-Archimedean Absolute Value, or Norm, is Less Than or Equal
to One, and an Archimedean Absolute Value, or Norm, is Unbounded 74
4.6 Going Further: Negative p-Adic Numbers, and p-Adic Fractions . . . . . . 75
Contents ix

4.7 Number Systems in the Physical and Natural Sciences . . . . . . . . . . . . 76


4.8 p-Adic Numbers in Computational Biology and Computer Hardware . . . . 77
4.9 Measurement Requires a Norm, Implying Distance and Topology . . . . . . 78
4.10 Ultrametric Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.11 Short Review of p-Adic Cosmology . . . . . . . . . . . . . . . . . . . . . . . 80
4.12 Unbounded Increase in Mass or Other Measured Quantity . . . . . . . . . 81
4.13 Scale-Free Partial Order or Hierarchical Systems . . . . . . . . . . . . . . . 81
4.14 p-Adic Indexing of the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.15 Diffusion and Other Dynamic Processes in Ultrametric Spaces . . . . . . . 83

III New Challenges and New Solutions for Information Search


and Discovery 85

5 Fast, Linear Time, m-Adic Hierarchical Clustering 87


5.1 Pervasive Ultrametricity: Computational Consequences . . . . . . . . . . . 87
5.1.1 Ultrametrics in Data Analytics . . . . . . . . . . . . . . . . . . . . . 87
5.1.2 Quantifying Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.3 Pervasive Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.4 Computational Implications . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Applications in Search and Discovery using the Baire Metric . . . . . . . . 89
5.2.1 Baire Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Large Numbers of Observables . . . . . . . . . . . . . . . . . . . . . 89
5.2.3 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.4 First Approach Based on Reduced Precision of Measurement . . . . 90
5.2.5 Random Projections in High-Dimensional Spaces, Followed by the
Baire Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.6 Summary Comments on Search and Discovery . . . . . . . . . . . . 91
5.3 m-Adic Hierarchy and Construction . . . . . . . . . . . . . . . . . . . . . . 91
5.4 The Baire Metric, the Baire Ultrametric . . . . . . . . . . . . . . . . . . . 92
5.4.1 Metric and Ultrametric Spaces . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 Ultrametric Baire Space and Distance . . . . . . . . . . . . . . . . . 93
5.5 Multidimensional Use of the Baire Metric through Random Projections . . 94
5.6 Hierarchical Tree Defined from m-Adic Encoding . . . . . . . . . . . . . . . 95
5.7 Longest Common Prefix and Hashing . . . . . . . . . . . . . . . . . . . . . 96
5.7.1 From Random Projection to Hashing . . . . . . . . . . . . . . . . . . 96
5.8 Enhancing Ultrametricity through Precision of Measurement . . . . . . . . 97
5.8.1 Quantifying Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . 97
5.8.2 Pervasiveness of Ultrametricity . . . . . . . . . . . . . . . . . . . . . 98
5.9 Generalized Ultrametric and Formal Concept Analysis . . . . . . . . . . . . 99
5.9.1 Generalized Ultrametric . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.9.2 Formal Concept Analysis . . . . . . . . . . . . . . . . . . . . . . . . 99
5.10 Linear Time and Direct Reading Hierarchical Clustering . . . . . . . . . . 100
5.10.1 Linear Time, or O(N ) Computational Complexity, Hierarchical Clus-
tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.10.2 Grid-Based Clustering Algorithms . . . . . . . . . . . . . . . . . . . 100
5.11 Summary: Many Viewpoints, Various Implementations . . . . . . . . . . . 101
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
x Contents

6 Big Data Scaling through Metric Mapping 103


6.1 Mean Random Projection, Marginal Sum, Seriation . . . . . . . . . . . . . 104
6.1.1 Mean of Random Projections as A Seriation . . . . . . . . . . . . . . 105
6.1.2 Normalization of the Random Projections . . . . . . . . . . . . . . . 107
6.2 Ultrametric and Ordering of Rows, Columns . . . . . . . . . . . . . . . . . 108
6.3 Power Iteration Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Input Data for Eigenreduction . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 Implementation: Equivalence of Iterative Approximation and Batch
Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Inducing a Hierarchical Clustering from Seriation . . . . . . . . . . . . . . 111
6.6 Short Summary of All These Methodological Underpinnings . . . . . . . . 112
6.6.1 Trivial First Eigenvalue, Eigenvector in Correspondence Analysis . . 112
6.7 Very High-Dimensional Data Spaces: Data Piling . . . . . . . . . . . . . . 113
6.8 Recap on Correspondence Analysis for Following Applications . . . . . . . 114
6.8.1 Clouds of Points, Masses and Inertia . . . . . . . . . . . . . . . . . . 115
6.8.2 Relative and Absolute Contributions . . . . . . . . . . . . . . . . . . 116
6.9 Evaluation 1: Uniformly Distributed Data Cloud Points . . . . . . . . . . . 117
6.9.1 Computation Time Requirements . . . . . . . . . . . . . . . . . . . . 118
6.10 Evaluation 2: Time Series of Financial Futures . . . . . . . . . . . . . . . . 118
6.11 Evaluation 3: Chemistry Data, Power Law Distributed . . . . . . . . . . . 120
6.11.1 Data and Determining Power Law Properties . . . . . . . . . . . . . 120
6.11.2 Randomly Generating Power Law Distributed Data in Varying Em-
bedding Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.12 Application 1: Quantifying Effectiveness through Aggregate Outcome . . . 124
6.12.1 Computational Requirements, from Original Space and Factor Space
Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.13 Application 2: Data Piling as Seriation of Dual Space . . . . . . . . . . . . 125
6.14 Brief Concluding Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.15 Annex: R Software Used in Simulations and Evaluations . . . . . . . . . . 126
6.15.1 Evaluation 1: Dense, Uniformly Distributed Data . . . . . . . . . . . 127
6.15.2 Evaluation 2: Financial Futures . . . . . . . . . . . . . . . . . . . . . 128
6.15.3 Evaluation 3: Chemicals of Specified Marginal Distribution . . . . . 129

IV New Frontiers: New Vistas on Information, Cognition and


the Human Mind 131

7 On Ultrametric Algorithmic Information 133


7.1 Introduction to Information Measures . . . . . . . . . . . . . . . . . . . . . 133
7.2 Wavelet Transform of a Set of Points Endowed with an Ultrametric . . . . 134
7.3 An Object as a Chain of Successively Finer Approximations . . . . . . . . 137
7.3.1 Approximation Chain using a Hierarchy . . . . . . . . . . . . . . . . 138
7.3.2 Dendrogram Wavelet Transform of Spherically Complete Space . . . 138
7.4 Generating Faces: Case Study Using a Simplified Model . . . . . . . . . . . 139
7.4.1 A Simplified Model of Face Generation . . . . . . . . . . . . . . . . . 139
7.4.2 Discussion of Psychological and Other Consequences . . . . . . . . . 143
7.5 Complexity of an Object: Hierarchical Information . . . . . . . . . . . . . . 143
7.6 Consequences Arising from This Chapter . . . . . . . . . . . . . . . . . . . 144
Contents xi

8 Geometry and Topology of Matte Blanco’s Bi-Logic in Psychoanalytics 147


8.1 Approaching Data and the Object of Study, Mental Processes . . . . . . . 147
8.1.1 Historical Role of Psychometrics and Mathematical Psychology . . . 148
8.1.2 Summary of Chapter Content . . . . . . . . . . . . . . . . . . . . . . 148
8.1.3 Determining Depth of Emotion, and Tracking Emotion . . . . . . . 148
8.2 Matte Blanco’s Psychoanalysis: A Selective Review . . . . . . . . . . . . . 152
8.3 Real World, Metric Space: Context for Asymmetric Mental Processes . . . 155
8.4 Ultrametric Topology, Background and Relevance in Psychoanalysis . . . . 156
8.4.1 Ultrametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.4.2 Inducing an Ultrametric through Agglomerative Hierarchical Cluster-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4.3 Transitions from Metric to Ultrametric Representation, and Vice
Versa, through Data Transformation . . . . . . . . . . . . . . . . . . 157
8.4.4 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.5 Conclusion: Analytics of Human Mental Processes . . . . . . . . . . . . . . 159
8.6 Annex 1: Far Greater Computational Power of Unconscious Mental Processes 160
8.7 Annex 2: Text Analysis as a Proxy for Both Facets of Bi-Logic . . . . . . . 161

9 Ultrametric Model of Mind: Application to Text Content Analysis 163


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2 Quantifying Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.1 Ultrametricity Coefficient of Lerman . . . . . . . . . . . . . . . . . . 164
9.2.2 Ultrametricity Coefficient of Rammal, Toulouse and Virasoro . . . . 164
9.2.3 Ultrametricity Coefficients of Treves and of Hartman . . . . . . . . . 165
9.2.4 Bayesian Network Modelling . . . . . . . . . . . . . . . . . . . . . . 165
9.2.5 Our Ultrametricity Coefficient . . . . . . . . . . . . . . . . . . . . . 165
9.2.6 What the Ultrametricity Coefficient Reveals . . . . . . . . . . . . . . 166
9.3 Semantic Mapping: Interrelationships to Euclidean, Factor Space . . . . . . 167
9.3.1 Correspondence Analysis: Mapping χ2 into Euclidean Distances . . . 167
9.3.2 Input: Cloud of Points Endowed with the Chi-Squared Metric . . . . 167
9.3.3 Output: Cloud of Points Endowed with the Euclidean Metric in Factor
Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.3.4 Conclusions on Correspondence Analysis and Introduction to the Nu-
merical Experiments to Follow . . . . . . . . . . . . . . . . . . . . . 169
9.4 Determining Ultrametricity through Text Unit Interrelationships . . . . . . 170
9.4.1 Brothers Grimm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.4.2 Jane Austen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.4.3 Air Accident Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.4.4 DreamBank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.5 Ultrametric Properties of Words . . . . . . . . . . . . . . . . . . . . . . . . 174
9.5.1 Objectives and Choice of Data . . . . . . . . . . . . . . . . . . . . . 174
9.5.2 General Discussion of Ultrametricity of Words . . . . . . . . . . . . 175
9.5.3 Conclusions on the Word Analysis . . . . . . . . . . . . . . . . . . . 175
9.6 Concluding Comments on this Chapter . . . . . . . . . . . . . . . . . . . . 177
9.7 Annex 1: Pseudo-Code for Assessing Ultrametric-Respecting Triplet . . . . 177
9.8 Annex 2: Bradley Ultrametricity Coefficient . . . . . . . . . . . . . . . . . 178
xii Contents

10 Concluding Discussion on Software Environments 181


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.2 Complementary Use with Apache Solr (and Lucene) . . . . . . . . . . . . . 182
10.3 In Summary: Treating Massive Data Sets with Correspondence Analysis . 182
10.3.1 Aggregating Similar or Identical Profiles Is Welcome . . . . . . . . . 182
10.3.2 Resolution Level of the Analysis Carried Out . . . . . . . . . . . . . 183
10.3.3 Random Projections in Order to Benefit from Data Piling in High
Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.3.4 Massive Observation Cardinality, Moderate Sized Dimensionality . . 184
10.4 Concluding Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Bibliography 187

Index 203
Preface

This is my motto: Analysis is nothing, data are everything. Today, on the web, we
can have baskets full of data . . . baskets or bins?
Jean-Paul Benzécri, 2011

This book describes solid and supportive foundations for the data science of our times,
with many illustrative cases. Core to these foundations are mathematics and computational
science. Our thinking and decision-making in regard to data can follow the insightful ob-
servation by the physicist Paul Dirac that physical theory and physical meaning have to
follow behind the mathematics (see Section 4.7). The hierarchical nature of complex reality
is part and parcel of this mathematically well-founded way of observing and interacting
with physical, social and all realities.
Quite wide-ranging case studies are used in this book. The text, however, is written in an
accessible and easily grasped way, for a reader who is knowledgeable and engaged, without
necessarily being an expert in all matters. Ultimately this book seeks to inspire, motivate and
orientate our human thinking and acting regarding data, associated information and derived
knowledge. This book seeks to give the reader a good start towards practical and meaningful
perspectives. Also, by seeking to chart out future perspectives, this book responds to current
needs in a way that is unlike other books of some relevance to this field, and that may be
great in their own specialisms.
The field of data science has come into its own, in a highly profiled way, in recent times.
Ever increasing numbers of employees are required nowadays, as data scientists, in sectors
that range from retail to regulatory, and so much besides. Many universities, having started
graduate-level courses in data science, are now also starting undergraduate courses. Data
science encompasses traditional disciplines of computational science and statistics, data
analysis, machine learning and pattern recognition. But new problem domains are arising.
Back in the 1970s and into the 1980s, one had to pay a lot of attention to available memory
storage when working with computers. Therefore, that focus of attention was on stored
data directly linked to the computational processing power. By the beginning of the 1990s,
communication and networking had become the focus of attention. Against the background
of regulatory and proprietary standards, and open source communication protocols (ISO
standards, Decnet, TCP/IP protocols, and so on), data access and display protocols became
so central (File Transfer Protocol, gopher, Veronica, Wide Area Information Server, and
Hypertext Transfer Protocol). So the focus back in those times was on: firstly, memory
and computer power; and secondly, communications and networking. Now we have, thirdly,
data as the prime focus. Such waves of technology developments are exciting. They motivate
the tackling of new problems, and also there may well be the requirement for new ways of
addressing problems. Such requirement of new perspectives and new approaches is always
due to contemporary inadequacies, limitations and underperformance. Now, we move on to
our interacting with data.
This book targets rigour, and mathematics, and computational thinking. Through avail-
able data sets and R code, reproducibility by the reader of results and outcomes is facilitated.
Indeed, understanding is also facilitated through “learning by doing”. The case studies and

xiii
xiv Preface

the available data and software codes are intended to help impart the data science phi-
losophy in the book. In that sense, dialoguing with data, and “letting the data speak”
(Jean-Paul Benzécri), are the perspective and the objective. To the foregoing quotations,
the following will be added: “visualization and verbalization of data” (cf. [34]).
Our approach is influenced by how the leading social scientist, Pierre Bourdieu, used the
most effective inductive analytics developed by Jean-Paul Benzécri. This family of geomet-
ric data analysis methodologies, centrally based on correspondence analysis encompassing
hierarchical clustering, and statistical modelling, not only organizes the analysis method-
ology and domain of application but, most of all, integrates them. An inspirational set of
principles for data analytics, listed in [24] (page 6), included the following: “The model
should follow the data, and not the reverse. . . . What we need is a rigorous method that
extracts structures from data.” Closely coupled to this is that “data synthesis” could be
considered as equally if not more important relative to “data analysis” [27]. Analysis and
synthesis of data and information obviously go hand in hand.
A very minor note is the following. Analytics refers to general and generic data process-
ing, obtaining information from data, while analysis refers to specific data processing.
We have then the following. “If I make extensive use of correspondence analysis, in
preference to multivariate regression, for instance, it is because correspondence analysis is a
relational technique of data analysis whose philosophy corresponds exactly to what, in my
view, the reality of the social world is. It is a technique which ‘thinks’ in terms of relation,
as I try to do precisely in terms of field” (Bourdieu, cited in [133, p. 43]).
“In Data Analysis, numerous disciplines need to collaborate. The role of mathematics,
although essential, is modest, in the sense that one uses almost exclusively classical the-
orems or elementary demonstration techniques. But it is necessary that certain abstract
conceptions enter into the spirits of the users, the specialists who collect the data and who
should orientate the analysis according to fundamental problems that are appropriate to
their science” [27].
No method is fruitful unless the data are relevant: “analysing data is not the collecting
of disparate data and seeing what comes out of the computer” [27]. In contradistinction
to statistics being “technical control” of process, certifying that work has been carried out
in conformance with rules, there with primacy accorded to being statistically correct, even
asking if such and such a procedure has the right to be used – in contradistinction to that,
there is relevance, asking if there is interest in using such and such a procedure.
Another inspirational quotation is that “the construction of clouds leads to the mastery
of multidimensionality, by providing ‘a tool to make patterns emerge from data’” (this is
from Benzécri’s 1968 Honolulu conference, when the 1969 proceedings had the paper, “Sta-
tistical analysis as a tool to make patterns emerge from data”). John Tukey (developer of
exploratory data analysis, i.e. visualization in statistics and data analysis, the fast Fourier
transform, and many other methods) expressed this as follows: “Let the data speak for
themselves!” This can be kept in mind relative to direct, immediate, unmediated statistical
hypothesis testing that relies on a wide range of assumptions (e.g. normality, homoscedas-
ticity, etc.) that are often unrealistic and unverifiable.
The foregoing and the following are in [130]. “Data analysis, or more particularly ge-
ometric data analysis is the multivariate statistical approach, developed by J.-P. Benzécri
around correspondence analysis, in which data are represented in the form of clouds of
points and the interpretation is first and foremost on the clouds of points.”
While these are our influences, it would be good, too, to note how new problem areas of
Big Data are of concern to us, and also issues of Big Data ethics. A possible ethical issue,
entirely due to technical aspects, in the massification and reduction through scale effects
that are brought about by Big Data. From [130]: “Rehabilitation of individuals. The context
Preface xv

model is always formulated at the individual level, being opposed therefore to modelling at
an aggregate level for which the individuals are only an ‘error term’ of the model.”
Now let us look at the importance of homology and field, concepts that are inherent
to Bourdieu’s work. The comprehensive survey of [108] sets out new contemporary issues
of sampling and population distribution estimation. An important take-home message is
this: “There is the potential for big data to evaluate or calibrate survey findings . . . to help
to validate cohort studies”. Examples are discussed of “how data . . . tracks well with the
official”, and contextual, repository or holdings. It is well pointed out how one case study
discussed “shows the value of using ‘big data’ to conduct research on surveys (as distinct
from survey research)”. Therefore, “The new paradigm means it is now possible to digitally
capture, semantically reconcile, aggregate, and correlate data.”
Limitations, though, are clear [108]: “Although randomization in some form is very
beneficial, it is by no means a panacea. Trial participants are commonly very different
from the external . . . pool, in part because of self-selection”. This is because “One type of
selection bias is self-selection (which is our focus)”.
Important points towards addressing these contemporary issues include the following
[108]: “When informing policy, inference to identified reference populations is key”. This is
part of the bridge which is needed between data analytics technology and deployment of
outcomes. “In all situations, modelling is needed to accommodate non-response, dropouts
and other forms of missing data.”
While “Representativity should be avoided”, here is an essential way to address in a
fundamental way what we need to address [108]: “Assessment of external validity, i.e. gen-
eralization to the population from which the study subjects originated or to other popula-
tions, will in principle proceed via formulation of abstract laws of nature similar to physical
laws”.
The bridge between the data that is analysed, and the calibrating Big Data, is well
addressed by the geometry and topology of data. Those form the link between sampled data
and the greater cosmos. Pierre Bourdieu’s concept of field is a prime exemplar. Consider, as
noted in [132], how Bourdieu’s work involves “putting his thinking in mathematical terms”,
and that it “led him to a conscious and systematic move toward a geometric frame-model”.
This is a multidimensional “structural vision”. Bourdieu’s analytics “amounted to the global
[hence Big Data] effects of a complex structure of interrelationships, which is not reducible
to the combination of the multiple [effects] of independent variables”. The concept of field,
here, uses geometric data analysis that is core to the integrated data and methodology
approach used in the correspondence analysis platform [177].
In addressing the “rehabilitation of individuals”, which can be considered as address-
ing representativity both quantitatively as well as qualitatively, there is the potential and
relevance for the many ethical issues related to Big Data, detailed in [199]. We may say
that in the detailed case study descriptions in that book, what is unethical is the arbitrary
representation of an individual by a class or group.
The term analytics platform for the science of data, which is quite central to this book,
can be associated with an interesting article by New York Times author Steve Lohr [146]
on the “platform thinking” of the founders of Microsoft, Intel and Apple. In this book
the analytics platform is paramount, over and above just analytical or software tools. In his
article [146] Lohr says: “In digital-age competition, the long goal is to establish an industry-
spanning platform rather than merely products. It is platforms that yield the lucrative
flywheel of network effects, complementary products and services and increasing returns.” In
this book we describe a data analytics platform. It is to have the potential to go way beyond
mere tools. It is to be accepted that software tools, incorporating the needed algorithms,
can come to one’s aid in the nick of time. That is good. But for a deep understanding of
all aspects of potential (i.e. having potential for further usage and benefit) and practice,
xvi Preface

“platform” is the term used here for the following: potential importance and relevance, and
a really good conceptional understanding or role. The excellent data analyst does not just
come along with a software bag of tricks. The outstanding data analyst will always strive
for full integration of theory and practice, of methodology and its implementation.
An approach to drawing benefit from Big Data is precisely as described in [108]. The
observation of the need for the “formulation of abstract laws” that bridge sampled data
and calibrating Big Data can be addressed, for the data analyst and for the application
specialist, as geometric and topological.
In summary, then, this book’s key points include the following.

• Our analytics are based on letting the data speak.


• Data synthesis, as well as information and knowledge synthesis, is as important as data
analysis.
• In our analytics, an aim too is to rehabilitate the individual (see above).
• We have as a central focus the seeking of, and finding, homology in practice. This is
very relevant for Big Data calibration of our analytics.
• In high dimensions, all becomes hierarchical. This is because as dimensionality tends
to infinity, and this is a nice perspective on unconscious thought processes, then metric
becomes ultametric.
• A major problem of our times may be addressed in both geometric and algebraic ways
(remembering Dirac’s quotation about the primacy of mathematics even over physics).
• We need to bring new understanding to bear on the dark energy and dark matter of
the cosmos that we inhabit, and of the human mind, and of other issues and matters
besides. These are among the open problems that haunt humanity.

One major motivation for some of this book’s content, related to the fifth item here, is
to see, and draw benefit from, the remarkable simplicity of very high dimensions, and even
infinite dimensionality. With reference to the last item here, there is a very nice statement
by Immanuel Kant, in Chapter 34 of Critique of Practical Reason (1788): “Two things fill
the mind with ever newer and increasing wonder and awe, the more often and lasting that
reflection is concerned with them: the starry sky over me, and the moral law within me.”

The Book’s Website


The website accompanying this book, which can be found at
http://www.DataScienceGeometryTopology.info
has data sets which are referred to and used in the text. It also has accessible R code which
has been used in the many and varied areas of work that are at issue in this book. In some
cases, too, there are graphs and figures from outputs obtained.
Provision of data and of some R software, and in a few cases, other software, is with
the following objective: to facilitate learning by doing, i.e. carrying out analyses, and re-
producing results and outcomes. That may be both interesting and useful, in parallel with
the more methodology-related aspects that can be, and that ought to be, revealing and
insightful.
Preface xvii

Collaborators and Benefactors: Acknowledgements


Key collaborating partners are acknowledged when our joint work is cited throughout the
book.
A key stage in this overall work was PhD scholarship funding, with support from the
Smith Institute for Industrial Mathematics and System Engineering, and with company
support for that, from ThinkingSafe.
Further background were the courses, based on all or very considerable parts of this
work, that were taught in April–May 2013 at the First International Conference on Models
of Complex Hierarchic Systems and Non-Archimedean Analysis, Cinvestav, Abacus Cen-
ter, Mexico; and in August 2015 at ESSCaSS 2015, the 14th Estonian Summer School on
Computer and Systems Science, Nelijärve Puhkekeskus, Estonia.
Among far-reaching applications of this work there has been a support framework for
creative writing that resulted in many books being published. Comparative and qualita-
tive data and information assessment can be well and truly integrated with actionable
decision-making. Section 2.7, contains a short overview of these outcomes with quite major
educational, publishing and related benefits. It is nice to note that this work was awarded
a prestigious teaching prize in 2010, at Royal Holloway University of London. Colleagues
Dr Joe Reddington and Dr Douglas Cowie and I, well linked to this book’s qualitative
and quantitative analytics platform, obtained this award with the title, “Project TooMany-
Cooks: applying software design principles to fiction writing”.
A number of current collaborations and partnerships, including with corporate and
government agencies, will perhaps deliver paradigm-shift advances.

Brief Introduction to Chapters


The chapters of this book are quite largely self-contained, meaning that in a summary way,
or sometimes with more detail, there can be essential material that is again presented in any
given chapter. This is done so as to take into account the diversity of application domains.

• Chapter 1 relates to the mapping of the semantics, i.e. the inherent meaning and sig-
nificance of information, underpinning and underlying what is expressed textually and
quantitatively. Examples include script story-line analysis, using film script, national
research funding, and performance management.
• Chapter 2 relates to a case study of change over time in Twitter. Quantification, includ-
ing even statistical analysis, of style is motivated by domain-originating stylistic and
artistic expertise and insight. Also covered is narrative synthesis and generation.
• Those two chapters comprise Part I, relating to film and movie, literature and docu-
mentation, some social media such as Twitter, and the recording, in both quantitative
and qualitative ways, of some teamwork activities.
• The accompanying website has as its aim to encourage and to facilitate learning and
understanding by doing, i.e. by actively undertaking experimentation and familiarization
with all that is described in this book.

• Next comes Part II, relating to underpinning methodology and vantage points.
xviii Preface

Paramount are geometry for the mapping of semantics, and, based on this, tree or
hierarchical topology, for lots of objectives.
• Chapter 3 relates to how hierarchy can express symmetry. Also at issue is how such
symmetries in data and information can be so revealing and informative.
• Chapter 4 is a review chapter, relating to fundamental aspects that are intriguing, and
maybe with great potential, in particular for cosmology. This chapter relates to the
theme that analytics through real-valued mathematics can be very beneficially com-
plemented by p-adic and, relatedly, m-adic number theory. There is some discussion of
relevance and importance in physics and cosmology.
• Part III relates to outcomes from somewhat more computational perspectives.
• Chapter 5 explains the operation of, and the great benefits to be derived from, linear-
time hierarchical clustering. Lots of associations with other techniques and so on are
included.
• The focus in Chapter 6 is on new application domains such as very high-dimensional
data. The chapter describes what we term informally the remarkable simplicity of very
high-dimensional data, and, quite often, very big data sets and massive data volumes.
• Part IV seeks to describe new perspectives arising out of all of the analytics here, with
relevance for various application domains.
• Chapter 7 relates to novel definitions and usage of the concept of information.
• Then Chapter 8 relates to ultrametric topology expressing or symbolically representing
human unconscious reasoning. Inspiration for this most important and insightful work
comes from the eminent psychoanalyst Ignacio Matte Blanco’s pursuit of bi-logic, the
human’s two modes of being, conscious and unconscious.
• Chapter 9 takes such analytics further, with application to very varied expressions of
narrative, embracing literature, event and experience reporting.
• Chapter 10 discusses a little the broad and general application of methods at issue here.
Part I

Narratives from Film and


Literature, from Social Media
and Contemporary Life
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
1
The Correspondence Analysis Platform for
Mapping Semantics

1.1 The Visualization and Verbalization of Data


All-important for the big picture to be presented is introductory description of the geometry
of data, and how we can proceed to both visualizing data and interpreting data. We can
even claim to be verbalizing our data. To begin with, the embedding of our data in a metric
space is our very central interest in the geometry of data. This metric space provides a latent
semantic representation of our data. Semantics, or meaning, comes from the sum total of
the interrelations of our observations or objects, and of their attributes or properties. Our
particular focus is on mathematical description of our data analysis platform (or framework).
We then move from the geometry of metric spaces to the hierarchical topology that
allows our data to be structured into clusters.
We address both the mathematical framework and underpinnings, and also algorithms.
Hand in hand with the algorithms goes implementation in R (typically).
Contemporary information access is very often ad hoc. Querying a search engine ad-
dresses some user needs, with content that is here, there and anywhere. Information re-
trieval results in bits and pieces of information that are provided to the user. On the other
hand, information synthesis can refer to the fact that multiple documents and information
sources will provide the final and definitive user information. This challenge of Big Data is
now looming (J. Mothe, personal communication): “Big Data refers to the fact that data
or information is voluminous, varied, and has velocity but above all that it can lead to
value provided that its veracity has been properly checked. It implies new information sys-
tem architecture, new models to represent and analyse heterogeneous information but also
new ways of presenting information to the user and of evaluating model effectiveness. Big
Data is specifically useful for competitive intelligence activities.” It is this outcome that is
a good challenge, that is to be addressed through the geometry and topology of data and
information: “aggregating information from heterogeneous resources is unsolved.”
We can and we will anticipate various ways to address these interesting new challenges.
Jean-Paul Benzécri, who was ahead of his time in so many ways, indicated (including in
[27]) that “data synthesis” could be considered as equally if not more important relative to
“data analysis”. Analysis and synthesis of data and information obviously go hand in hand.
Data analytics are just one side of what we are dealing with in this book. The other side,
we could say, is that of inductive data analysis. In the context or framework of practical
data-related and data-based activity, the processes of data synthesis and inductive data
analysis are what we term a narrative. In that sense, we claim to be tracing and tracking
the lives of narratives. That is, in physical and behavioural activities, and of course in
mental and thought processes.

3
4 Data Science Foundations

1.2 Analysis of Narrative from Film and Drama


1.2.1 Introduction
We study two aspects of information semantics: (i) the collection of all relationships; (ii)
tracking and spotting anomaly and change. The first is implemented by endowing all relevant
information spaces with a Euclidean metric in a common projected space. The second is
modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding
of different information spaces based on cross-tabulation counts (and from other input data
formats) is provided by correspondence analysis. From there, the induced ultrametric that
we are particularly interested in takes a sequential (e.g. temporal) – ordering of the data
into account. We employ such a perspective to look at narrative, “the flow of thought and
the flow of language” [45]. In application to policy decision-making, we show how we can
focus analysis in a small number of dimensions.
The data mining and data analysis challenges addressed are the following.

• Great masses of data, textual and otherwise, need to be exploited and decisions need
to be made. Correspondence analysis handles multivariate numerical and symbolic data
with ease.

• Structures and interrelationships evolve in time.


• We must consider a complex web of relationships.
• We need to address all these issues from data sets and data flows.

Various aspects of how we respond to these challenges will be discussed in this chapter,
complemented by the annex to the chapter. We will look at how this works, using the
Casablanca film script. Then we return to the data mining approach used, to propose that
various issues in policy analysis can be addressed by such techniques also.

1.2.2 The Changing Nature of Movie and Drama


McKee [153] bears out the great importance of the film script: “50% of what we understand
comes from watching it being said.” And: “A screenplay waits for the camera. . . . Ninety
percent of all verbal expression has no filmic equivalent.”
An episode of a television series costs [177] $2–3 million per hour of television, or
£600,000–800,000 for a similar series in the UK. Generally screenplays are written spec-
ulatively or commissioned, and then prototyped by the full production of a pilot episode.
Increasingly, and especially availed of by the young, television series are delivered via the
Internet.
Originating in one medium – cinema, television, game, online – film and drama series are
increasingly migrated to another. So scriptwriting must take account of digital multimedia
platforms. This has been referred to in computer networking parlance as “multiplay” and
in the television media sector as a “360 degree” environment.
Cross-platform delivery motivates interactivity in drama. So-called reality TV has a
considerable degree of interactivity, as well as being largely unscripted.
There is a burgeoning need for us to be in a position to model the semantics of film script,
– its most revealing structures, patterns and layers. With the drive towards interactivity, we
also want to leverage this work towards more general scenario analysis. Potential applica-
tions are to business strategy and planning; education and training; and science, technology
The Correspondence Analysis Platform for Mapping Semantics 5

and economic development policy. We will discuss initial work on the application to policy
decision-making in Section 1.3 below.

1.2.3 Correspondence Analysis as a Semantic Analysis Platform


For McKee [153], film script text is the “sensory surface of a work of art” and reflects
the underlying emotion or perception. Our data mining approach models and tracks these
underlying aspects in the data. Our approach to textual data mining has a range of novel
elements.
Firstly, a novelty is our focus on the orientation of narrative through correspondence
analysis [24, 171] which maps scenes (and sub-scenes) and words used, in a largely automated
way, into a Euclidean space representing all pairwise interrelationships. Such a space is
ideal for visualization. Interrelationships between scenes are captured and displayed, as well
as interrelationships between words, and mutually between scenes and words. In a given
context, comprehensive and exhaustive data, with consequent understanding and use of
one’s actionable data, are well and truly integrated in this way.
The starting point for analysis is frequency of occurrence data, typically the ordered
scenes crossed by all words used in the script.
If the totality of interrelationships is one facet of semantics, then another is anomaly or
change as modelled by a clustering hierarchy. If, therefore, a scene is quite different from
immediately previous scenes, then it will be incorporated into the hierarchy at a high level.
This novel view of hierarchy will be discussed further in Section 1.2.5 below.
We draw on these two vantage points on semantics – viz. totality of interrelationships,
and using a hierarchy to express change.
Among further work that is covered in Section 1.2.9 and further in Section 2.5 of Chapter
2 is the following. We can design a Monte Carlo approach to test statistical significance
of the given script’s patterns and structures as opposed to randomized alternatives (i.e.
randomized realizations of the scenes). Alternatively, we examine caesuras and breakpoints
in the film script, by taking the Euclidean embedding further and inducing an ultrametric
on the sequence of scenes.

1.2.4 Casablanca Narrative: Illustrative Analysis


The well-known movie Casablanca serves as an example for us. Film scripts, such as for
Casablanca, are partially structured texts. Each scene has metadata, and the body of the
scene contains dialogue and possibly other descriptive data. The Casablanca script was half
completed when production began in 1942. The dialogue for some scenes was written while
shooting was in progress. Casablanca was based on an unpublished 1940 screenplay [43]. It
was scripted by J.J. Epstein, P.G. Epstein and H. Koch. The film was directed by M. Curtiz
and produced by H.B. Wallis and J.L. Warner. It was shot by Warner Bros. between May
and August 1942.
As an illustrative first example we use the following. A data set was constructed from
the 77 successive scenes crossed by attributes: Int[erior], Ext[erior], Day, Night, Rick, Ilsa,
Renault, Strasser, Laszlo, Other (i.e. minor character), and 29 locations. Many locations
were met with just once; and Rick’s Café was the location of 36 scenes. In scenes based in
Rick’s Café we did not distinguish between “Main room”, “Office”, “Balcony”, etc. Because
of the plethora of scenes other than Rick’s Café we assimilate these to just one, “other than
Rick’s Café”, scene.
In Figure 1.1, 12 attributes are displayed. If useful, the 77 scenes can be displayed as
dots (to avoid overcrowding of labels). Approximately 34% (for factor 1) + 15% (for factor
2) = 49% of all information, expressed as inertia explained, is displayed here. We can study
6 Data Science Foundations

1.5 Strasser
.

.
1.0

.
Factor 2, 15% of inertia

. . Ilsa Renault
. .
. . .
0.5

. .
. .
.
NotRicks . .
Other .
Laszlo .
. . Rick Int
. . .
. . .
0.0

. . .
Day . . .
. . .
. .
.
.
. Night
−0.5

. RicksCafe
. .
Ext
.
.
−1.5 −1.0 −0.5 0.0 0.5

Factor 1, 34% of inertia


.

FIGURE 1.1: Correspondence analysis of the Casablanca data derived from the script.
The input data are presences/absences for 77 scenes crossed by 12 attributes. Just the 12
attributes are displayed. For a short review of the analysis methodology, see the annex to
this chapter.

interrelationships between characters, other attributes, and scenes, for instance closeness of
Rick’s Café with Night and Int (obviously enough).

1.2.5 Modelling Semantics via the Geometry and Topology of Informa-


tion
Some underlying principles are as follows. We start with the cross-tabulation data, scenes
× attributes. Scenes and attributes are embedded in a metric space. This is how we are
probing the geometry of information, which is a term and viewpoint used by [236].
Underpinning the display in Figure 1.1 is a Euclidean embedding. The triangle inequality
holds for metrics. An example of a metric is the Euclidean distance, exemplified in Figure
1.2(a), where each and every triplet of points satisfies the relationship d(x, z) ≤ d(x, y) +
d(y, z) for distance d. Two other relationships also must hold: symmetry (d(x, y) = d(y, x))
and positive definiteness (d(x, y) > 0 if x 6= y, d(x, y) = 0 if x = y).
Further underlying principles used in Figure 1.1 are as follows. The axes are the principal
axes of inertia. Principles identical to those in classical mechanics are used. The scenes are
located as weighted averages of all associated attributes, and vice versa.
Huyghens’ theorem (see Figure 1.2(b)) relates to decomposition of inertia of a cloud of
points. This is the basis of correspondence analysis.
We come now to a different principle: that of the topology of information. The particular
topology used is that of hierarchy. Euclidean embedding provides a very good starting point
to look at hierarchical relationships. One particular innovation in this work is as follows:
the hierarchy takes sequence (e.g. timeline) into account. This captures, in a more easily
understood way, the notions of novelty, anomaly or change.
Random documents with unrelated
content Scribd suggests to you:
Notices of Books.
King Country, or Explorations in New Zealand, by Mr J. H. Kerry-
Nicholls, 637
Mushroom-growing, by Mr Wright, 501
Nature near London, by Mr Richard Jefferies, 225
Norfolk Broads and Rivers, by Mr G. Christopher Davies, 273

Book Gossip—
Aileen Aroon, by Dr Gordon Stables, 63
Anglo-Saxon Literature, by Mr John Earle, Rawlinson Professor
of Anglo-Saxon, Oxford, 411
Arminius Vambery, his Life and Adventures, 207
Athole Collection of Dance Music of Scotland, by Mr James
Stewart Robertson (Edradynate), 208
Chapter of Science; or, What is the Law of Nature? by Mr J.
Stuart, Professor of Mechanics, 62
Diseases of Field and Garden Crops, by Worthington G. Smith,
825
Expansion of England, by Mr J. R. Seeley, Professor of Modern
History, 62
Guide to Methods of Insect Life, and Prevention and Remedy of
Insect Ravage, 271
Introduction to the Study of Modern Forest Economy, by Dr J. C.
Brown, 825
Killin Collection of Gaelic Songs, by Mr Charles Stewart, 824
London Cries; Chap-book Chaplets; and Bygone Beauties,
published by Messrs Field and Tuer, 63
More Bits from Blinkbonny, by ‘John Strathesk,’ 824
Norman Conquest, by Mr William Hunt, 411
Photography for Amateurs, by Mr T. C. Hepworth, 824
Practical Taxidermy, by Montague Brown, F.Z.S., 825
Shetland and the Shetlanders, by Sheriff Rampini, 271
Sprigs of Heather, or the Rambles of ‘Mayfly’ with Old Friends,
by Rev. John Anderson, D.D., 208
Whitaker’s Almanac for 1884, 63
Miscellaneous Articles of Instruction
and Entertainment.
Abe, Story of, 817
Acorns, Under the, 657
Acrobats, 318
Advertisers Again, Among the, 94
Almanacs, Romance of, 23
Amateur ‘Cabby,’ an, 778
America, European Emigration to, and its Effects, 641
American Newspapers on Themselves, 714
Amusements in Germany, Popular, 634
Ancient and Modern Statues, the Largest, 470
Ancient People, an, 430
Anglo-Indian Chaplain, Recollections of an, 792
Animal Life, Studies in, 822
—— Memorials and Mementoes, 285
Antipathies in Animals—
i. Horses, 85
ii. Dogs, 590
Architecture, Stained Glass as an Accessory to Domestic, 359
Army Schools, 494
Arsenic in Domestic Fabrics, 799
Artificial Jewels, 731
Ashburnham Collections, the, 341
Back from ‘Eldorado,’ 573
Bank of England, Curiosities of the, 737
Bird Migration, 481
Birds of Spring, 129
Bonded Warehouses, London, 58
Break-neck Venture, a, 588
Bridge, the Haunted, 814
British Museum, New Mediæval Room at the, 693
Brompton Cemetery, In, 753
Buried Alive, 222
Bushranger, Interviewed by a, 650
‘Cabby,’ an Amateur, 778
Calls before the Curtain, 135
Cameo-cutting, 224
Cave-chapels, 513
Charr of Windermere, the, 406
Chewton-Abbot, 280, 295, 315
Children, Over-educating, 366
Christmas Trees, 748
Cigars, 709
Circulating-library Critics, 81
Cliff-houses of Cañon de Chelly, 40
Coin Treasures, 249
Coins Wearing Away? Are our, 393
Colds, Common, 175
College, Queen Margaret, 555
—— Rooms, My Old, 262
Colonel Redgrave’s Legacy, 780, 793, 811
Colour-sense, 44
Commercial Products of the Whale, 566
Conversation, the Art of, 442
Cooking Classes for Children, 775
Cooking-stoves, Gas, 367
‘Corners,’ 289
Correspondence Classes, 555
Cricket, Umpires at, 399
Curiosities of the Bank of England, 737
Curiosities of the Electric Light, 140
—— —— —— Microphone, 373
—— —— —— Peerage, Some, 305, 326
Curiosity in Journalism, a, 200
Cycling, Progress of, 335
Cyprus Locusts, 801
Dauphins, False, 662
Death-claims, How Life-offices pay their, 97
Decisions, Some Legal, 423
Deer-forests, Scottish, 721
Detective Police, Our, 337
Dinner-parties Out of Doors, 673
Dishes, Some Queer, 230
Distillation in Ireland, Illicit, 644
Dwarfie Stone, Legend of the, 667
Eastern Trading, Some Instances of, 463
Edicts, Ancient Rock-hewn, 486
Educational Pioneer, an, 699
‘Eldorado,’ Back from, 573
Electric Light, Curiosities of the, 140
Electricity and Gas, the Future of, 625
—— for Nothing! 453
——, Lighting Collieries by, 496
Elephants, the Moulmein, 638
——, Trimming the Feet of, 240
Emigration to America, European, 641
English Law, Familiar Sketches of—
Marriages; Settlements; and Breaches of Promise to
i.
Marry, 102
ii. Parent and Child, 377
iii. Master and Servant, 490
Episodes of Literary Manuscripts, 283
Erratic Pens, 313
Errors in Domestic Medicine, Common, 299
European Emigration to America, and its Effects, 641
Explosion, Story of a Vast, 705
Fairs, Old Provincial, 598
Falkland Islands, a Peep at the, 110
False Dauphins, 662
Fellow-passenger, My, 252, 263
Florida, Concerning, 797
Food-notes, Some, 287
Foresight of Insects for their Young, 587
Forestry and Farming, 720
—— Exhibition, Edinburgh, 1884, International, 193
Fortunes, Sudden, 241
French Detectives, 48
Frendraught, the Fire of, 52
Fuel, a New, 671
Furniture Saleroom, In a, 379
Gas Cooking-stoves, 367
——, the Future of Electricity and, 625
Gentleman of the Road, a, 429
Germany, Popular Amusements in, 634
Girls, Wives, and Mothers, 33
Glacier Garden, a, 785
Gold, 209
Gold-fields, the Transvaal, 177
Good-natured People, Mischief done by, 111
Gordon’s College, Aberdeen, 183
‘Grand Day,’ 561
Greenroom Romance, a, 471
Grouse, 529
Gum-arabic and the Soudan, 640
Hampstead Heath, 65
‘Happy Ever After,’ 161
Haunted Houses, Rationale of, 397
Health, Our, 113, 234, 401
Heroines, 492
Highland Glen, In a, 511
Holiday, a River, 545
‘Home! Sweet Home!’ 173
Home-nursing, 417, 549, 609, 725
Homing Pigeon, the, 245
Honey-bee, Something about the, 409
Hospitals and Dispensaries, London, 518
Housewives, Hints for, 447
Humorous Definitions, 475, 669
Hush-money, 143
In a Flash, 520
India, Musk-rat of, 703
Indian Jugglers, 604
—— Snakes, 214
Insects, Foresight of, for their Young, 587
Ireland, Illicit Distillation in, 644
Island, a Solitary, 719
——, an Interesting, 347
Jaffa to Jerusalem, From, 321
‘Jerry-building’ in the Middle Ages, 464
Jewels, Artificial, 731
Joint-stock Companies and ‘Limited Liability,’ 577
Journalism, a Curiosity in, 200
‘King Country, the,’ 637
—— of Acres, a, 12, 29
‘Kitchen Kaffir,’ the, 117
Knowledge, a Little, not Dangerous, 616
Last of the Stuarts, the, 600, 617
Law, Sketches of English, 102, 377, 490
Legal Decisions, Some, 423
Life, Prolonging, 427
Life-assurance and Annuities, Post-office, 257
Lifeboat Competition, 459
Life-offices, How they pay their Death-claims, 97
Lighting Collieries by Electricity, 496
‘Limited Liability,’ Joint-stock Companies and, 577
Literary Beginners, Another Word to, 49
Literary Manuscripts, Episodes of, 283
—— Self-estimates, 220
Locusts, Cyprus, 801
London Bonded Warehouses, 58
—— Hospitals and Dispensaries, 518
London, Nature around, 225
——, Remains of Ancient, 654
——, Sanitary Inspection of the Port of, 534
Love, Concerning, 156, 333
Maiden Speeches, Parliamentary, 150
Man and Nature, 608
Manufactures, Noxious, 239
Marine Station, a Scottish, 465
Marriage, the Net of, 432
Marsala, a Sample of, 795
Mediæval Room at the British Museum, New, 693
Medicine, Common Errors in Domestic, 299
Microphone, Curiosities of the, 373
Migration, Bird, 481
Miner’s Partner, the, 138, 152, 168, 185
Miss Marrable’s Elopement, 188, 198
Missing Clue, the, 701, 702, 716, 718, 732, 749, 751, 764
Monastic England, 4
Money-borrowing, the Shady Side of, 166
Month, The: Science and Arts—59, 124, 201, 265, 348, 412, 476,
557, 620, 685, 761, 825
Moor and Loch, On, 433
Morality, Stock Exchange, 828
Mortality, Some Cheering Aspects of, 449
Moulmein Elephants, the, 638
Mr Pudster’s Return, 569, 584
Mrs Shaw, the Late Prince Imperial’s Nurse, 32
Mushrooms for the Million, 501
Musk-rat of India, the, 703
Name? What’s in a, 813
Nameless Romance, a, 541
Nature around London, 225
—— on the Roof, 385
Nettle-cloth, 145
New Zealand, Explorations in, 637
Newsmonger, the, 353
Newspapers, Curious, 591
—— on Themselves, American, 714
Norfolk Broads and Rivers, 273
Norman Seascape, a, 390
Notes on Persian Art, 808
Noxious Manufactures, 239

Occasional Notes—
Abnormal Humanity, 304
Advice to Intending Emigrants, 479
Albo-carbon Light, New, 830
Ambulance Societies, 63
American Literary Piracy, 271
Anthropometrical Laboratory at the Health Exhibition, 479
Bacchus, Discovery of Statue of, 656
Blindness in Infancy, Prevention of, 206
Burns and Scalds, 655
Canine ‘Collector,’ 415
Card-telegrams, 206
Casualties on the British Coast, 624
Chilian Argentine Andes, Exploration in the, 831
Coffee? Why do we now drink less, 303
Curious Disease, 480
Diarrhœa and Cholera, Treatment of, 560
Dissection after Death, 205
Dutch Rush, 351
Earthquake in England, Recent, 351
Electric Light in Railway Carriages, 205
Electric Lighting for Ships, Improved, 351
Electrical Tricycle, 320
Electricity as a Brake, 767
Ensilage, 829
Fastest Passage on Record, 415
French Crown Jewels, 559
Fruit-farm, a Flourishing, 207
Gas Cooking-stove, a Handy, 830
Grape and Peach in America, 207
Harbour of Refuge for East Coast of Scotland, 415
Herring Spawns, How and Where the, 206
Hydrophobia—Important Experiments, 205
Irish Female Emigration, 830
Labour and Wages in Australia, 204
Level-crossing Gates, 736
Lightning-strokes in France, 352
—— ——, Mechanical Characteristics of, 829
Lights and Lighthouses, Investigations on, 736
Marvellous Sunsets, Recent, 64
Metallic Compound, New, 415
Mummies, Making of, 767
Native Treatment of Diseases in India, 831
Novel Peal of Bells, a, 766
Oil Breakwater at Folkestone, 127
Old Westminster Houses, Last of the, 64
Old-fashioned Furniture, 320
Organ in Westminster Abbey, New, 479
Persons Killed by Wild Animals in India, 829
Postal Orders, New, 414
Railway Passengers, 830
Relics from the Holy Land, 768
Rome, Interesting Discovery at, 656
Russian Crown Estates, 204
—— Longevity, 272
Sion College, Last of Old, 830
Sowing and Harvesting, 272
Steam-ferry on the Thames, 767
Subterranean Fish, 415
Telegraph Extension, 127
Telegraphing Extraordinary, 624
Telephoning Extraordinary, 656
Trout-life, Interesting Notes on, 204
Turning Wood into Metal, 768
Uphill Railway, Another, 480
Utilisation of Sewage, 767

Old Provincial Fairs, 598


One Woman’s History, 631, 646, 664, 680, 696, 710, 728, 743,
757, 771, 786, 804, 819
Order of Mercy, an, 15, 63
Orkney Folk-lore—Legend of the Dwarfie Stone, 667
Outward and Homeward Bound, 301
Over-educating Children, 366
Paper, More Uses of, 742
Parasitic Worms—Queer Lodgers, 278
Parliamentary Maiden Speeches, 150
Parody, the Muse of, 71
Peer? What is a, 17
Peerage, Some Curiosities of the, 305, 326
Pencil-making, 582
Pens, Erratic, 313
People, an Ancient, 430
Persian Art, a Few Notes on, 808
—— Sherbet, Royal, 438
Peterborough, the ‘Strong-room’ at, 655
Pigeon, the Homing, 245
Pioneer, an Educational, 699
Pisciculture, the Progress of, 268
Poisoning, 769
Polecat, a Few Words about the, 190
Port of London, Sanitary Inspection of the, 534
Post-office Life-assurance and Annuities, 257
Prince Imperial’s Nurse, the Late; Mrs Shaw, 32
Printers’ Errors, 636
Prolonging Life, 427
Quarantine, 543
Queen Margaret College, 555
—— Margerie, 677
Queer Company, In, 444, 461
—— Dishes, Some, 230
—— Lodgers—Parasitic Worms, 278
Ranching, Some Realities of, 653
Rationale of Haunted Houses, the, 397
Recollections of an Anglo-Indian Chaplain, 792
Regiment, a Skating, 255
Remains of Ancient London, 654
Ring-trick, the, 735
River Holiday, a, 545
Rock-hewn Edicts, Ancient, 486
Roof, Nature on the, 385
Ruin, Sudden, 571
Run for Life, a, 488
Sacred Trees, Some, 509
Sample of Marsala, 795
Sanitary Inspection of the Port of London, 534
Schools, Army, 494
Science and Art School—Gordon’s College, Aberdeen, 183
Scot Abroad, Glimpses of the, 76
Scottish Deer-forests, 721
—— Marine Station, a, 465
Seals and Seal-hunting in Shetland, 364, 507
Seascape, a Norman, 390
Seashore Free to All? Is the, 539
Sensitive Plant, 159
Sherbet, Royal Persian, 438
Silas Monk: a Tale of London Old City, 361, 374, 394, 407
Skating Regiment, a, 255
Sketch from my Study Window, 343
Sledge-dogs, 9
Smoking Injurious to Health? Is, 78
Snakes: Do they ever commit Suicide? 672
Snakes, Indian, 214
‘So Unreasonable of Step-mother!’ 45
Solitary Island, a, 719
Spider-silk, 524
St John’s Gate, 527
St Marguerite and St Honorât, 369
St Peter’s, 91
Stained Glass as an Accessory to Domestic Architecture, 359
Statues, Ancient and Modern, the Largest, 470
Steel, 575
Steno-telegraph, the, 607
Stock Exchange Morality, 828
Story of a Vast Explosion, 705
Strange Institution, 96
‘Strong-room’ at Peterborough, the, 655
Studies in Animal Life, 822
Suakim, 196
Sudden Fortunes, 241
—— Ruin, 571
Suicide, 292
Superstitions, the Common-sense of, 237
Surgical Scraps, 383
Terribly Fulfilled, 424, 440, 454
Thieves and Thieving, 525
Trading, Some Instances of Eastern, 463
Transvaal Gold-fields, 177
Trees, Christmas, 748
——, Some Sacred, 509
Trimming the Feet of Elephants, 240
Troubadours, the, 171
Two Days in a Lifetime, 5, 25, 41, 54, 73, 87, 105, 119
Umpires at Cricket, 399
Vaccination, 593, 688
Væ Victis, 629
Vermudyn’s Fate, 536, 552
Washing by Steam, 800
Water, 497
Water-ousel, a Plea for the, 270
Weather, How it is Made and Forecast, 689
Whale, Commercial Products of the, 566
What’s in a Name? 813
White-lead Manufacture, a New Process of, 158
Windermere, the Charr of, 406
Wind-power, Abandonment of, 319
Witness for the Defence, a, 217, 231, 247
Woman’s Work, a Word on, 606
Yarn of the P. and O., a, 503
Zulu Romance, a, 330
*** END OF THE PROJECT GUTENBERG EBOOK CHAMBERS'S
JOURNAL OF POPULAR LITERATURE, SCIENCE, AND ART,
INDEX FOR 1884 ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying copyright
royalties. Special rules, set forth in the General Terms of Use part of
this license, apply to copying and distributing Project Gutenberg™
electronic works to protect the PROJECT GUTENBERG™ concept
and trademark. Project Gutenberg is a registered trademark, and
may not be used if you charge for an eBook, except by following the
terms of the trademark license, including paying royalties for use of
the Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is very
easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund from
the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.

You might also like