100% found this document useful (4 votes)
32 views66 pages

Full Architecting A Modern Data Warehouse For Large Enterprises: Build Multi-Cloud Modern Distributed Data Warehouses With Azure and AWS 1st Edition Anjani Kumar Ebook All Chapters

Anjani

Uploaded by

amuckanolote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
32 views66 pages

Full Architecting A Modern Data Warehouse For Large Enterprises: Build Multi-Cloud Modern Distributed Data Warehouses With Azure and AWS 1st Edition Anjani Kumar Ebook All Chapters

Anjani

Uploaded by

amuckanolote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Download Full Version ebookmass - Visit ebookmass.

com

Architecting a Modern Data Warehouse for Large


Enterprises: Build Multi-cloud Modern Distributed
Data Warehouses with Azure and AWS 1st Edition
Anjani Kumar
https://ebookmass.com/product/architecting-a-modern-data-
warehouse-for-large-enterprises-build-multi-cloud-modern-
distributed-data-warehouses-with-azure-and-aws-1st-edition-
anjani-kumar-2/

OR CLICK HERE

DOWLOAD NOW

Discover More Ebook - Explore Now at ebookmass.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Architecting a Modern Data Warehouse for Large


Enterprises: Build Multi-cloud Modern Distributed Data
Warehouses with Azure and AWS 1st Edition Anjani Kumar
https://ebookmass.com/product/architecting-a-modern-data-warehouse-
for-large-enterprises-build-multi-cloud-modern-distributed-data-
warehouses-with-azure-and-aws-1st-edition-anjani-kumar-2/
ebookmass.com

Architecting a Modern Data Warehouse for Large


Enterprises: Build Multi-cloud Modern Distributed Data
Warehouses with Azure and AWS 1st Edition Anjani Kumar
https://ebookmass.com/product/architecting-a-modern-data-warehouse-
for-large-enterprises-build-multi-cloud-modern-distributed-data-
warehouses-with-azure-and-aws-1st-edition-anjani-kumar-3/
ebookmass.com

Modern Data Architecture on Azure: Design Data-centric


Solutions on Microsoft Azure 1st Edition Sagar Lad

https://ebookmass.com/product/modern-data-architecture-on-azure-
design-data-centric-solutions-on-microsoft-azure-1st-edition-sagar-
lad/
ebookmass.com

Biochemistry and Molecular Biology 6th edition Snape

https://ebookmass.com/product/biochemistry-and-molecular-biology-6th-
edition-snape/

ebookmass.com
Electromagnetic Technologies in Food Science Vicente M.
Gómez-López

https://ebookmass.com/product/electromagnetic-technologies-in-food-
science-vicente-m-gomez-lopez/

ebookmass.com

Natural Obsession (Au Naturel Nights Book 1) Anna Durand

https://ebookmass.com/product/natural-obsession-au-naturel-nights-
book-1-anna-durand/

ebookmass.com

Climate Justice: Integrating Economics and Philosophy Ravi


Kanbur

https://ebookmass.com/product/climate-justice-integrating-economics-
and-philosophy-ravi-kanbur/

ebookmass.com

Offensive Speech, Religion, and the Limits of the Law


Nicholas Hatzis

https://ebookmass.com/product/offensive-speech-religion-and-the-
limits-of-the-law-nicholas-hatzis/

ebookmass.com

Stuttering Fifth Edition

https://ebookmass.com/product/stuttering-fifth-edition/

ebookmass.com
Studies In The History Of Monetary Theory: Controversies
And Clarifications 1st Edition David Glasner

https://ebookmass.com/product/studies-in-the-history-of-monetary-
theory-controversies-and-clarifications-1st-edition-david-glasner/

ebookmass.com
Architecting a Modern
Data Warehouse for
Large Enterprises
Build Multi-cloud Modern Distributed Data
Warehouses with Azure and AWS

Anjani Kumar
Abhishek Mishra
Sanjeev Kumar
Architecting a Modern
Data Warehouse for Large
Enterprises
Build Multi-cloud Modern
Distributed Data Warehouses
with Azure and AWS

Anjani Kumar
Abhishek Mishra
Sanjeev Kumar
Architecting a Modern Data Warehouse for Large Enterprises: Build Multi-cloud
Modern Distributed Data Warehouses with Azure and AWS
Anjani Kumar Abhishek Mishra
Gurgaon, India Thane West, Maharashtra, India

Sanjeev Kumar
Gurgaon, Haryana, India

ISBN-13 (pbk): 979-8-8688-0028-3 ISBN-13 (electronic): 979-8-8688-0029-0


https://doi.org/10.1007/979-8-8688-0029-0

Copyright © 2024 by Anjani Kumar, Abhishek Mishra, and Sanjeev Kumar


This work is subject to copyright. All rights are reserved by the publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the
trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Smriti Srivastava
Development Editor: Laura Berendson
Coordinating Editor: Shaul Elson
Copy Editor: April Rondeau
Cover designed by eStudioCalamar
Cover image by Sven [email protected]
Distributed to the book trade worldwide by Apress Media, LLC, 1 New York Plaza, New York, NY 10004,
U.S.A. Phone 1-800-SPRINGER, fax (201) 348-4505, email [email protected], or visit www.
springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science
+ Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected]; for reprint,
paperback, or audio rights, please e-mail [email protected].
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales
web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to
readers on GitHub (https://github.com/Apress). For more detailed information, please visit https://
www.apress.com/gp/services/source-code.
Paper in this product is recyclable
I dedicate this book to my mother, Prabhawati; my aunt,
Sunita; and my wife, Suchi.
— Anjani Kumar

I dedicate this book to my lovely daughter, Aaria.


— Abhishek Mishra

I dedicate this book to my late father, Shri Mahinder Nath.


— Sanjeev Kumar
Table of Contents
About the Authors���������������������������������������������������������������������������������������������������� xi

About the Technical Reviewer������������������������������������������������������������������������������� xiii


Acknowledgments���������������������������������������������������������������������������������������������������xv

Chapter 1: Introduction�������������������������������������������������������������������������������������������� 1
Objective��������������������������������������������������������������������������������������������������������������������������������������� 2
Origin of Data Processing and Storage in the Computer Era�������������������������������������������������������� 2
Evolution of Databases and Codd Rules���������������������������������������������������������������������������������������� 3
Transitioning to the World of Data Warehouses���������������������������������������������������������������������������� 6
Data Warehouse Concepts������������������������������������������������������������������������������������������������������������ 8
Data Sources (Data Format and Common Sources)�������������������������������������������������������������� 10
ETL (Extract, Transform, Load)����������������������������������������������������������������������������������������������� 13
Data Mart������������������������������������������������������������������������������������������������������������������������������� 16
Data Modeling����������������������������������������������������������������������������������������������������������������������� 18
Cubes and Reporting������������������������������������������������������������������������������������������������������������������� 27
OLAP�������������������������������������������������������������������������������������������������������������������������������������� 27
Metadata������������������������������������������������������������������������������������������������������������������������������������� 35
Data Storage Techniques and Options���������������������������������������������������������������������������������������� 38
Evolution of Big Data Technologies and Data Lakes������������������������������������������������������������������� 40
Transition to the Modern Data Warehouse���������������������������������������������������������������������������� 40
Traditional Big Data Technologies������������������������������������������������������������������������������������������ 45
The Emergence of Data Lakes����������������������������������������������������������������������������������������������� 46
Data Lake House and Data Mesh������������������������������������������������������������������������������������������������ 50
Transformation and Optimization between New vs. Old (Evolution to Data Lake House)������ 50
A Wider Evolving Concept Called Data Mesh������������������������������������������������������������������������� 51

v
Table of Contents

Building an Effective Data Engineering Team����������������������������������������������������������������������������� 53


An Enterprise Scenario for Data Warehousing���������������������������������������������������������������������������� 54
Summary������������������������������������������������������������������������������������������������������������������������������������ 55

Chapter 2: Modern Data Warehouses��������������������������������������������������������������������� 57


Objectives����������������������������������������������������������������������������������������������������������������������������������� 58
Introduction to Characteristics of Modern Data Warehouse������������������������������������������������������� 58
Data Velocity�������������������������������������������������������������������������������������������������������������������������� 58
Data Variety��������������������������������������������������������������������������������������������������������������������������� 59
Volume����������������������������������������������������������������������������������������������������������������������������������� 60
Data Value������������������������������������������������������������������������������������������������������������������������������ 61
Fault Tolerance���������������������������������������������������������������������������������������������������������������������� 61
Scalability������������������������������������������������������������������������������������������������������������������������������ 62
Interoperability���������������������������������������������������������������������������������������������������������������������� 62
Reliability������������������������������������������������������������������������������������������������������������������������������� 62
Modern Data Warehouse Features: Distributed Processing, Storage, Streaming,
and Processing Data in the Cloud����������������������������������������������������������������������������������������������� 62
Distributed Processing���������������������������������������������������������������������������������������������������������� 63
Storage���������������������������������������������������������������������������������������������������������������������������������� 64
Streaming and Processing���������������������������������������������������������������������������������������������������� 65
Autonomous Administration Capabilities������������������������������������������������������������������������������� 67
Multi-tenancy and Security��������������������������������������������������������������������������������������������������� 70
Performance�������������������������������������������������������������������������������������������������������������������������� 71
What Are NoSQL Databases?������������������������������������������������������������������������������������������������������ 76
Key–Value Pair Stores����������������������������������������������������������������������������������������������������������� 77
Document Databases������������������������������������������������������������������������������������������������������������ 79
Columnar DBs������������������������������������������������������������������������������������������������������������������������ 80
Graph Databases������������������������������������������������������������������������������������������������������������������� 82
Case Study: Enterprise Scenario for Modern Cloud-­based Data Warehouse������������������������������ 84
Advantages of Modern Data Warehouse over Traditional Data Warehouse��������������������������������� 87
Summary������������������������������������������������������������������������������������������������������������������������������������ 92

vi
Table of Contents

Chapter 3: Data Lake, Lake House, and Delta Lake������������������������������������������������� 95


Structure������������������������������������������������������������������������������������������������������������������������������������� 96
Objectives����������������������������������������������������������������������������������������������������������������������������������� 96
Data Lake, Lake House, and Delta Lake Concepts���������������������������������������������������������������������� 96
Data Lake, Storage, and Data Processing Engines Synergies and Dependencies�������������������� 102
Implement Lake House in Azure����������������������������������������������������������������������������������������������� 102
Create a Data Lake on Azure and Ingest the Health Data CSV File�������������������������������������� 103
Create an Azure Synapse Pipeline to Convert the CSV File to a Parquet File���������������������� 110
Attach the Parquet File to the Lake Database��������������������������������������������������������������������� 131
Implement Lake House in AWS������������������������������������������������������������������������������������������������� 138
Create an S3 Bucket to Keep the Raw Data������������������������������������������������������������������������ 139
Create an AWS Glue Job to Convert the Raw Data into a Delta Table���������������������������������� 143
Query the Delta Table using the AWS Glue Job�������������������������������������������������������������������� 155
Summary���������������������������������������������������������������������������������������������������������������������������������� 159

Chapter 4: Data Mesh������������������������������������������������������������������������������������������� 161


Structure����������������������������������������������������������������������������������������������������������������������������������� 161
Objectives��������������������������������������������������������������������������������������������������������������������������������� 161
The Modern Data Problem and Data Mesh������������������������������������������������������������������������������� 162
Data Mesh Principles���������������������������������������������������������������������������������������������������������������� 163
Domain-driven Ownership��������������������������������������������������������������������������������������������������� 163
Data-as-a-Product��������������������������������������������������������������������������������������������������������������� 164
Self-Serve Data Platform����������������������������������������������������������������������������������������������������� 164
Federated Computational Governance�������������������������������������������������������������������������������� 165
Design a Data Mesh on Azure��������������������������������������������������������������������������������������������������� 165
Create Data Products for the Domains�������������������������������������������������������������������������������� 167
Create Self-Serve Data Platform����������������������������������������������������������������������������������������� 172
Federated Governance�������������������������������������������������������������������������������������������������������� 173
Summary���������������������������������������������������������������������������������������������������������������������������������� 174

vii
Table of Contents

Chapter 5: Data Orchestration Techniques����������������������������������������������������������� 175


Structure����������������������������������������������������������������������������������������������������������������������������������� 175
Objective����������������������������������������������������������������������������������������������������������������������������������� 176
Data Orchestration Concepts���������������������������������������������������������������������������������������������������� 176
Modern Data Orchestration in Detail����������������������������������������������������������������������������������� 178
Evolution of Data Orchestration������������������������������������������������������������������������������������������� 182
Data Integration������������������������������������������������������������������������������������������������������������������������ 187
Middleware and ETL Tools��������������������������������������������������������������������������������������������������� 188
Enterprise Application Integration (EAI)������������������������������������������������������������������������������� 189
Service-Oriented Architecture (SOA)����������������������������������������������������������������������������������� 191
Data Warehousing��������������������������������������������������������������������������������������������������������������� 192
Real-Time and Streaming Data Integration������������������������������������������������������������������������� 193
Cloud-Based Data Integration��������������������������������������������������������������������������������������������� 195
Data Integration for Big Data and NoSQL���������������������������������������������������������������������������� 196
Self-Service Data Integration���������������������������������������������������������������������������������������������� 198
Data Pipelines��������������������������������������������������������������������������������������������������������������������������� 201
Data Processing using Data Pipelines��������������������������������������������������������������������������������� 204
Benefits and Advantages of Data Pipelines������������������������������������������������������������������������� 212
Common Use Cases for Data Pipelines������������������������������������������������������������������������������� 214
Data Governance Empowered by Data Orchestration: Enhancing Control and Compliance����� 216
Achieving Data Governance through Data Orchestration���������������������������������������������������� 217
Tools and Examples������������������������������������������������������������������������������������������������������������������ 220
Azure Data Factory�������������������������������������������������������������������������������������������������������������� 220
Azure Synapse��������������������������������������������������������������������������������������������������������������������� 225
Summary���������������������������������������������������������������������������������������������������������������������������������� 254

Chapter 6: Data Democratization, Governance, and Security������������������������������� 255


Objectives��������������������������������������������������������������������������������������������������������������������������������� 256
Introduction to Data Democratization���������������������������������������������������������������������������������� 256
Factors Driving Data Democratization��������������������������������������������������������������������������������� 256
Layers of Democratization Architecture������������������������������������������������������������������������������ 258
Self-Service������������������������������������������������������������������������������������������������������������������������� 260

viii
Table of Contents

Data Catalog and Data Sharing������������������������������������������������������������������������������������������� 260


People���������������������������������������������������������������������������������������������������������������������������������� 264
Tools and Technology: Self-Service Tools���������������������������������������������������������������������������� 264
Data Governance Tools�������������������������������������������������������������������������������������������������������� 265
Introduction to Data Governance���������������������������������������������������������������������������������������������� 269
Ten Key Factors that Ensure Successful Data Governance������������������������������������������������� 271
Data Stewardship���������������������������������������������������������������������������������������������������������������� 273
Models of Data Stewardship����������������������������������������������������������������������������������������������� 273
Data Security Management������������������������������������������������������������������������������������������������������ 282
Security Layers�������������������������������������������������������������������������������������������������������������������� 284
Data Security Approach������������������������������������������������������������������������������������������������������� 288
Types of Controls����������������������������������������������������������������������������������������������������������������� 290
Data Security in Outsourcing Mode������������������������������������������������������������������������������������� 292
Popular Information Security Frameworks�������������������������������������������������������������������������� 293
Major Privacy and Security Regulations������������������������������������������������������������������������������ 295
Major Modern Security Management Concepts������������������������������������������������������������������ 296
Practical Use Case for Data Governance and Data Democratization���������������������������������������� 300
Problem Statement�������������������������������������������������������������������������������������������������������������� 301
High-Level Proposed Solution��������������������������������������������������������������������������������������������� 303
Summary���������������������������������������������������������������������������������������������������������������������������������� 304

Chapter 7: Business Intelligence�������������������������������������������������������������������������� 307


Structure����������������������������������������������������������������������������������������������������������������������������������� 307
Objectives��������������������������������������������������������������������������������������������������������������������������������� 307
Introduction to Business Intelligence���������������������������������������������������������������������������������������� 308
Descriptive Reports������������������������������������������������������������������������������������������������������������� 309
Predictive Reports��������������������������������������������������������������������������������������������������������������� 309
Prescriptive Reports������������������������������������������������������������������������������������������������������������ 310
Business Intelligence Tools������������������������������������������������������������������������������������������������������� 311
Query and Reporting Tools��������������������������������������������������������������������������������������������������� 313
Online Analytical Processing (OLAP) Tools��������������������������������������������������������������������������� 315
Analytical Applications�������������������������������������������������������������������������������������������������������� 316

ix
Table of Contents

Trends in Business Intelligence (BI)������������������������������������������������������������������������������������������ 322


Business Decision Intelligence Analysis������������������������������������������������������������������������������ 322
Self-Service������������������������������������������������������������������������������������������������������������������������� 323
Advanced BI Analytics��������������������������������������������������������������������������������������������������������� 324
BI and Data Science Together���������������������������������������������������������������������������������������������� 332
Data Strategy����������������������������������������������������������������������������������������������������������������������� 333
Data and Analytics Approach and Strategy������������������������������������������������������������������������� 337
Summary���������������������������������������������������������������������������������������������������������������������������������� 349

Index��������������������������������������������������������������������������������������������������������������������� 351

x
About the Authors
Anjani Kumar is the managing director and founder of
MultiCloud4u, a rapidly growing startup that helps clients
and partners seamlessly implement data-driven solutions
for their digital businesses. With a background in computer
science, Anjani began his career researching and developing
multi-lingual systems that were powered by distributed
processing and data synchronization across remote regions
of India. He later collaborated with companies such as
Mahindra Satyam, Microsoft, RBS, and Sapient to create data
warehouses and other data-based systems that could handle
high-volume data processing and transformation.

Dr. Abhishek Mishra is a cloud architect at a leading


organization and has more than a decade and a half of
experience building and architecting software solutions
for large and complex enterprises across the globe. He has
deep expertise in enabling digital transformations for his
customers using the cloud and artificial intelligence.

xi
About the Authors

Sanjeev Kumar heads up a global data and analytics practice


at the leading and oldest multinational shoe company with
headquarters in Switzerland. He has 19+ years of experience
working for organizations in multiple industries modeling
modern data solutions. He has consulted with some of the
top multinational firms and enabled digital transformations
for large enterprises using modern data warehouses in
the cloud. He is an expert in multiple fields of modern
data management and execution, including data strategy,
automation, data governance, architecture, metadata,
modeling, business intelligence, data management, and
analytics.

xii
About the Technical Reviewer
Viachaslau Matsukevich is an industry expert with
over a decade of experience in various roles, including
DevOps, cloud, solutions architecture, tech leadership, and
infrastructure engineering.
As a cloud solutions architect, Viachaslau has delivered
20+ DevOps projects for a number of Fortune 500 and
Global 2000 enterprises. He holds certifications from
Microsoft, Google, and the Linux Foundation, including
Solutions Architect Expert, Professional Cloud Architect, and
Kubernetes Administrator.
Viachaslau authors technology articles about cloud-native technologies and
Kubernetes, for platforms such as Red Hat Enable Architect, SD Times, Hackernoon,
and Dzone.
In addition to his technical expertise, Viachaslau serves as a technical reviewer for
technology books, ensuring the quality and accuracy of the latest publications.
He has also made significant contributions as an industry expert and judge
for esteemed awards programs, including SIIA CODiE Awards and Globee Awards
(including IT World Awards, Golden Bridge Awards, Disruptor Company Awards, and
American Best in Business Awards). Viachaslau has also lent his expertise as a judge in
over 20 hackathons.
Viachaslau is also the author of online courses covering a wide array of topics related
to cloud, DevOps and Kubernetes tools.
Follow Viachaslau on LinkedIn: https://www.linkedin.com/in/viachaslau-
matsukevich/

xiii
Acknowledgments
We would like to thank Apress for giving us the opportunity to work on this book. Also,
thanks to the technical reviewer and the editor and the entire Apress team for supporting
us on this journey.

xv
CHAPTER 1

Introduction
In the early days of computing, businesses struggled to keep up with the flood of data.
They had few options for storing and analyzing data, hindering their ability to make
informed decisions. As technology improved, businesses recognized the value of data
and needed a way to make sense of it. This led to the birth of data warehousing, coined
by Bill Inmon in the 1980s. Inmon’s approach was focused on structured, relational
data for reporting and analysis. Early data warehouses were basic but set the stage
for more advanced solutions as businesses gained access to more data. Today, new
technologies like Big Data and data lakes have emerged to help deal with the increasing
volume and complexity of data. The data lakehouse combines the best of data lakes and
warehouses for real-time processing of both structured and unstructured data, allowing
for advanced analytics and machine learning. While the different chapters of this book
cover all aspects of modern data warehousing, this chapter specifically focuses on the
transformation of data warehousing techniques from past to present to future, and how
it impacts building a modern data warehouse.
In this chapter we will explore the following:

• History and Evolution of Data Warehouse


• Basic Concepts and Features of Data Warehouse

• Advantages and Examples of Cloud-based Data Warehouse

• Enterprise Scenario for Data Warehouse

1
© Anjani Kumar, Abhishek Mishra, and Sanjeev Kumar 2024
A. Kumar et al., Architecting a Modern Data Warehouse for Large Enterprises,
https://doi.org/10.1007/979-8-8688-0029-0_1
Chapter 1 Introduction

Objective
This chapter provides an overview of data warehouses and familiarizes the readers with
the terminologies and concepts of data warehouses. The chapter further focuses on the
transformation of data warehousing techniques from past to present to future, and how
it impacts building a modern data warehouse.
After studying this chapter, you should be able to do the following:

• Understand the basics of data warehousing, from the tools, processes,


and techniques used in modern-day data warehousing to the
different roles and responsibilities of a data warehouse team.

• Set up a synergy between engineering and operational communities,


even when they’re at different stages of learning and implementation
maturity.

• Determine what to adopt and what to ignore, ensuring your team


stays up to date with the latest trends in data warehousing.

Whether you’re starting a data warehouse team or just looking to expand your
knowledge, this guide is the perfect place to start. It will provide you with a background
on the topics covered in detail in further chapters, allowing you to better understand the
nuances of data warehousing and become an expert in the field.

 rigin of Data Processing and Storage


O
in the Computer Era
The history of data processing and storage dates back to the early 20th century
when mechanical calculators were used for basic arithmetic operations. However, it
wasn’t until the mid-20th century that electronic computers were developed, which
revolutionized data processing and storage.
The first electronic computer, the ENIAC (Electronic Numerical Integrator and
Computer), was built in 1946 by J. Presper Eckert and John Mauchly. It was a massive
machine that filled an entire room and used vacuum tubes to perform calculations.
ENIAC was primarily used for military purposes, such as computing artillery
firing tables.

2
Chapter 1 Introduction

In the 1950s and ’60s, the development of smaller and faster transistors led to the
creation of smaller and more efficient computers. The introduction of magnetic tape and
magnetic disks in the late 1950s allowed for the storage of large amounts of data, which
could be accessed much more quickly than with punched cards or paper tape.
In the 1970s, the development of integrated circuits (ICs) made it possible to create
even smaller and more powerful computers. This led to the development of personal
computers in the 1980s, which were affordable and accessible to a wide range of users.
Today, data processing and storage are essential to nearly every aspect of modern
life, from scientific research to business and commerce to entertainment. The rapid
growth of the modern storage solution powered by SSD and flash memory and by
internet and cloud computing has made it possible to store and access vast amounts of
data from almost anywhere in the world.
In conclusion, the origin of data processing and storage can be traced back to the
early 20th century, but it wasn’t until the development of electronic computers in the
mid-20th century that these processes became truly revolutionary. From massive room-
sized machines to powerful personal computers, data processing and storage have come
a long way and are now essential to almost every aspect of modern life.

Evolution of Databases and Codd Rules


The evolution of databases began with IBM’s development of the first commercially
successful database management system (DBMS) in the 1960s. The relational model of
databases, introduced by E.F. Codd in the 1970s, organized data into tables consisting
of rows and columns, leading to the development of Structured Query Language
(SQL). The rise of the internet and e-commerce in the 1990s led to the development of
NoSQL databases for handling vast amounts of unstructured data. The Chord protocol,
proposed in 2001, is a distributed hash table (DHT) algorithm used for maintaining the
consistency and reliability of data across multiple nodes in distributed databases.
Codd’s 12 principles for relational databases established a framework for designing
and implementing a robust, flexible, and scalable data management system. These
principles are relevant in data warehousing today because they provide a standard for
evaluating data warehousing systems and ensuring that they can handle large volumes
of data, support complex queries, maintain data integrity, and evolve over time to meet
changing business needs.

3
Chapter 1 Introduction

The 12 principles of Codd’s rules for relational databases are as follows:

1. Information Rule: All data in the database should be represented


as values in tables. This means that the database should be
structured as a collection of tables, with each table representing a
single entity or relationship.

2. Guaranteed Access Rule: Each value in the database should


be accessible by specifying its table name, primary key value,
and column name. This ensures that every piece of data in the
database is uniquely identifiable and can be accessed efficiently.

3. Systematic Treatment of Null Values: The database should


support the use of null values to represent missing or unknown
data. These null values should be treated consistently throughout
the system, with appropriate support for operations such as null
comparisons and null concatenations.

4. Dynamic Online Catalog Based on the Relational Model: The


database should provide a dynamic online catalog that describes
the structure of the database in terms of tables, columns,
indexes, and other relevant information. This catalog should be
accessible to users and applications and should be based on the
relational model.

5. Comprehensive Data Sublanguage Rule: The database should


support a comprehensive data sublanguage that allows users to
define, manipulate, and retrieve data in a variety of ways. This
sublanguage should be able to express complex queries, data
definitions, and data modifications.

6. View Updating Rule: The database should support the updating


of views, which are virtual tables that are defined in terms of other
tables. This allows users to modify data in a flexible and intuitive
way, without having to worry about the underlying structure of the
database.

4
Chapter 1 Introduction

7. High-Level Insert, Update, and Delete Rule: The database


should support high-­level insert, update, and delete operations
that allow users to modify multiple rows or tables at once. This
simplifies data management and improves performance by
reducing the number of database interactions required.

8. Physical Data Independence: The database should be able to


store and retrieve data without being affected by changes to the
physical storage or indexing structure of the database. This allows
the database to evolve over time without requiring significant
changes to the application layer.

9. Logical Data Independence: The database should be able to


store and retrieve data without being affected by changes to the
logical structure of the database. This means that the database
schema can be modified without requiring changes to the
application layer.

10. Integrity Independence: The database should be able to enforce


integrity constraints such as primary keys, foreign keys, and
other business rules without being affected by changes to the
application layer. This ensures that data is consistent and accurate
at all times.

11. Distribution Independence: The database should be able to


distribute data across multiple locations without being affected by
changes to the application layer. This allows the database to scale
horizontally and geographically without requiring changes to the
application layer.

12. Non-Subversion Rule: The database should not be susceptible


to subversion by unauthorized users or applications. This means
that the database should enforce access controls, encryption, and
other security measures to protect against unauthorized access or
modification of data.

5
Chapter 1 Introduction

Traditional tabular systems based on Codd rules were relevant, but with the rise
of the internet and e-commerce, there was a huge increase in the volume and
variety of data being generated. To handle this data, new NoSQL databases were
developed, which are more flexible and scalable, especially for unstructured data.
In building a universally accepted data warehouse, it’s important to consider the
strengths and weaknesses of both traditional and NoSQL databases and follow
best practices, such as data quality, data modeling, data governance, and security
measures. In the upcoming section of this chapter, we will explore this transition
in a step-by-step manner while giving special attention to the areas that remain
relevant for creating a strong and widely accepted modern data warehouse.

Transitioning to the World of Data Warehouses


In the 1970s, the dominant form of database used in business was the hierarchical
database, which organized data in a tree-like structure, with parent and child nodes.
However, as businesses began to collect more data and as the need for complex
querying and reporting increased, it became clear that the hierarchical database was not
sufficient.
This led to the development of the network database, which allowed for more
complex relationships between data, but it was still limited in its ability to handle large
volumes of data and complex querying. As a result, the relational database model was
developed, which organized data into tables consisting of rows and columns, allowing
for more efficient storage and easier retrieval of information.
However, the relational model was not without its limitations. As businesses
continued to collect more data, the need for a centralized repository to store and manage
data became increasingly important. This led to the development of the data warehouse,
which is a large, centralized repository of data that is optimized for reporting and
analysis.
The data warehouse is designed to handle large volumes of data from multiple
sources and to provide a single source of truth for reporting and analytics. Data
warehouses use specialized technologies, such as extract, transform, load (ETL)
processes, to extract data from multiple sources, transform it into a common format, and
load it into the data warehouse.

6
Chapter 1 Introduction

Data warehouses also use specialized tools for querying and reporting, such as
online analytical processing (OLAP), which allows users to analyze data across multiple
dimensions, and data mining, which uses statistical and machine learning techniques to
identify patterns and relationships in the data.

The world transitioned to data warehousing from databases in the 1970s as


businesses realized the limitations of the hierarchical and network database
models when handling large volumes of data and complex querying. The
development of the data warehouse provided a centralized repository for storing
and managing data, as well as specialized tools for reporting and analysis. Today,
data warehouses are a critical component of modern businesses, enabling them to
make data-driven decisions and stay competitive in a rapidly changing market.

During this pivotal transition in the world of data management, numerous scientists
and experts made significant contributions to the field. Notable among them are
Bill Inmon, revered as the originator of the data warehouse concept, which focuses
on a single source of truth for reporting and analysis; Ralph Kimball, a renowned
data warehousing expert who introduced dimensional modeling, which emphasizes
optimized data modeling for reporting, star schemas, and fact tables; and Dan Linstedt,
who invented the data vault modeling approach, which combines elements of Inmon
and Kimball’s methodologies and is tailored for handling substantial data volumes
and historical reporting. In addition, Claudia Imhoff, a business intelligence and data
warehousing expert, founded the Boulder BI Brain Trust, offering thought leadership;
Barry Devlin pioneered the business data warehouse concept, which highlights business
metadata’s importance and aligns data warehousing with business objectives; and, lastly,
Jim Gray, a computer scientist and database researcher, who contributed significantly by
introducing the data cube, a multidimensional database structure for enhanced analysis
and reporting. In conclusion, these luminaries represent just a fraction of the visionary
minds that shaped modern data warehousing, empowering businesses to harness data
for informed decision-making in a dynamic market landscape.

7
Chapter 1 Introduction

Data Warehouse Concepts


In today’s world, businesses collect more data than ever before. This data can come from
a variety of sources, such as customer transactions, social media, and Internet of Things
(IoT) devices. However, collecting data is only the first step; to truly unlock the value
of this data, businesses must be able to analyze and report on it. This is where the data
warehouse comes in. The following are aspects of the data warehouse:
• A data warehouse is a large, centralized repository of data that is
optimized for reporting and analysis. The data warehouse is designed
to handle large volumes of data from multiple sources, and to provide a
single source of truth for reporting and analytics. It is a critical component
of modern business intelligence, enabling businesses to make data-
driven decisions and stay competitive in a rapidly changing market.
• Data warehouses use specialized technologies, such as extract,
transform, load (ETL) processes, to extract data from multiple
sources, transform it into a common format, and load it into the
data warehouse. This allows businesses to bring together data from
disparate sources and create a single, unified view of the data.
• Data warehouses also use specialized tools for querying and
reporting, such as online analytical processing (OLAP), which allows
users to analyze data across multiple dimensions, and data mining,
which uses statistical and machine learning techniques to identify
patterns and relationships in the data.
• One of the key features of the data warehouse is its ability to handle
historical data. Traditional transactional databases are optimized for
handling current data, but they are not well suited to handling large
volumes of historical data. Data warehouses, however, are optimized
for handling large volumes of historical data, which is critical for
trend analysis and forecasting.

• In addition, data warehouses are designed to be easy to use for


business users. They use specialized reporting tools that allow users
to create custom reports and dashboards, and to drill down into the
data to gain deeper insights. This makes it easy for business users to
access and analyze the data they need to make informed decisions.

8
Chapter 1 Introduction

There are several common concepts in data warehouses that are essential to
understanding their architecture. Here are some of the most important concepts:

• Data Sources: A data warehouse collects data from a variety of


sources, such as transactional databases, external data sources, and
flat files. Data is extracted from these sources and transformed into a
standardized format before being loaded into the data warehouse.

• ETL (Extract, Transform, Load): This is the process used to collect


data from various sources and prepare it for analysis in the data
warehouse. During this process, data is extracted from the source
systems, transformed into a common format, and loaded into the
data warehouse.

• Data Marts: A data mart is a subset of a data warehouse that is


designed to meet the needs of a particular department or group
within an organization. Data marts are typically organized around
specific business processes or functions, such as sales or marketing.

• Data Modeling: In the field of data warehousing, there are two main
approaches to modeling data: tabular modeling and dimensional
modeling. Tabular modeling is a relational approach to data
modeling, which means it organizes data into tables with rows and
columns. Dimensional modeling involves organizing data around
dimensions (such as time, product, or location) and measures (such
as sales revenue or customer count) and using a star or snowflake
schema to represent the data.

• OLAP (Online Analytical Processing): OLAP is a set of tools and


techniques used to analyze data in a data warehouse. OLAP tools
allow users to slice and dice data along different dimensions and to
drill down into the data to gain deeper insights.

• Data Mining: Data mining is the process of analyzing large datasets


to identify patterns, trends, and relationships in the data. This
technique uses statistical and machine learning algorithms to
discover insights and make predictions based on the data.

9
Chapter 1 Introduction

• Metadata: Metadata is data about the data in a data warehouse. It


provides information about the source, structure, and meaning of the
data in the warehouse, and is essential for ensuring that the data is
accurate and meaningful.

Data Sources (Data Format and Common Sources)


In a data warehouse, data source refers to any system or application that provides data
to the data warehouse. A data source can be any type of system or application that
generates data, such as a transactional system, a customer relationship management
(CRM) application, or an enterprise resource planning (ERP) system.
The data from these sources is extracted and transformed before it is loaded into
the data warehouse. This process involves cleaning, standardizing, and consolidating
the data to ensure that it is accurate, consistent, and reliable. Once the data has been
transformed, it is then loaded into the data warehouse for storage and analysis.
In some cases, data sources may be connected to the data warehouse using extract,
transform, and load (ETL) processes, while in other cases, they may be connected
using other data integration methods, such as data replication, data federation, or data
virtualization.

Note Data sources are a critical component of a data warehouse, as they


provide the data that is needed to support business intelligence and analytics. By
consolidating data from multiple sources into a single location, a data warehouse
enables organizations to gain insights into their business operations and make
more-informed decisions.

There are various types and formats of data sources that can be used in a data
warehouse. Here are some examples:

• Relational databases: A common data source for a data warehouse


is a relational database, such as Oracle, Microsoft SQL Server, or
MySQL. These databases store data in tables with defined schemas
and can be queried using SQL.

10
Chapter 1 Introduction

• Flat files: Data can also be sourced from flat files, such as CSV files,
Parquet, Excel, or any other formatted text files. These files typically
have a delimited format with columns and rows.

• Cloud storage services: Cloud storage services, such as Amazon


S3 or Azure Data Lake Storage, can also be used as a data source for
a data warehouse. These services can store data in a structured or
unstructured format and can be accessed through APIs.

• NoSQL databases: NoSQL databases, such as MongoDB or


Cassandra, can be used as data sources for data warehouses. These
databases are designed to handle large volumes of unstructured data
and can be queried using NoSQL query languages.

• Real-time data sources: Real-time data sources, such as message


queues or event streams, can be used to stream data into a data
warehouse in real-time. This type of data source is often used for
applications that require up-to-date data.

• APIs: APIs can also be used as a data source, providing access to data
from third-party applications or web services.

Format of the data coming from multiple sources can also vary depending on the type
of data. For example, data can be structured or unstructured, semi-structured, such as
JSON or XML. The data format needs to be considered when designing the data warehouse
schema and the ETL processes. It is important to ensure that the data is properly
transformed and loaded into the data warehouse in a format that is usable for analysis.
Data can flow to the data warehouse through different systems, some of the most
used of which include the following:

• Transactional databases: Transactional databases are typically


the primary source of data for a data warehouse. These databases
capture and store business data generated by various systems, such
as sales, finance, and operations.

• ERP systems: Enterprise resource planning (ERP) systems are used


by many organizations to manage their business processes. ERP
systems can provide a wealth of data that can be used in a data
warehouse, including information on customer orders, inventory,
and financial transactions.

11
Chapter 1 Introduction

• CRM systems: Customer relationship management (CRM) systems


provide data on customer interactions that can be used to support
business analytics and decision-making.

• Legacy systems: Legacy systems are often used to store important


historical data that needs to be incorporated into the data warehouse.
This data may be stored in a variety of formats, including flat files or
proprietary databases.

• Cloud-based systems: Cloud-based systems, such as software-as-


a-service (SaaS) applications, are becoming increasingly popular as
data sources for data warehouses. These systems can provide access
to a variety of data, including customer behavior, website traffic, and
sales data.

• Social media: Social media platforms are another source of data


that can be used in a data warehouse. This data can be used to gain
insights into customer behavior, sentiment analysis, and brand
reputation.

One effective approach for documenting data-related artifacts, such as data sources
and data flows, is using data dictionaries and data catalogs. These tools can capture
relevant information about data elements, including their structure and meaning, as well
as provide more comprehensive details about data sources, flows, lineage, and ownership.
By leveraging these tools, implementation teams and data operations teams can gain a
better understanding of this information, leading to improved data quality, consistency,
and collaboration across various teams and departments within an organization.

Note When categorizing data into structured or unstructured sources, you’ll


find that older systems like transactional, ERP, CRM, and legacy tend to have
well-organized and -classified data compared to that sourced from cloud-based
systems or social media. It’s not entirely accurate to say that all data from
cloud platforms and website analytics activities are unstructured, but analyzing
such data requires additional computing power to derive significant insights.
With the adoption of cloud computing, organizations are increasingly storing
unstructured data.

12
Chapter 1 Introduction

ETL (Extract, Transform, Load)


ETL stands for extract, transform, load. It is a process used to move data from one or
more source systems, transform the data to fit business needs, and load the data into a
target system, such as a data warehouse.
The ETL process is an essential component of a data warehouse, as it enables
organizations to consolidate and integrate data from multiple sources into a single,
unified view of their business operations. Here is a brief overview of the ETL process:

• Extract: The first step in the ETL process is to extract the data from
the source systems. This can be done using various methods, such as
APIs, file transfers, or direct database connections.

• Transform: Once the data has been extracted, it needs to be


transformed to fit the needs of the data warehouse. This may involve
cleaning the data, consolidating duplicate records, converting data
types, or applying business rules and calculations.

• Load: After the data has been transformed, it is loaded into the target
system, such as a data warehouse. This can be done using various
methods, such as bulk inserts, incremental updates, or real-time
streaming.

The ETL process can be complex and time-consuming, particularly for large datasets
or complex data models. However, modern ETL tools and technologies, such as cloud-
based data integration platforms, have made the process more efficient and scalable.

A well-designed ETL process is critical to the success of a data warehouse, as it


ensures that the data is accurate, consistent, and reliable. By providing a unified
view of business data, a data warehouse enables organizations to gain insights
into their operations, identify trends and patterns, and make more informed
decisions.

13
Chapter 1 Introduction

There are many ETL (extract, transform, load) software tools available, both
commercial and open source. Here are some examples:

• Informatica PowerCenter: Informatica PowerCenter is a


popular ETL tool that offers a wide range of data integration and
transformation features, including data profiling, data quality, and
metadata management.

• Microsoft SQL Server Integration Services (SSIS): SSIS is a


powerful ETL tool that is part of the Microsoft SQL Server suite.
It provides a wide range of data integration and transformation
features, including data cleansing, data aggregation, and data
enrichment.

• Talend Open Studio: Talend Open Studio is an open source ETL


tool that offers a broad range of data integration and transformation
features, including support for Big Data platforms like Hadoop
and Spark.

• IBM InfoSphere DataStage: IBM InfoSphere DataStage is a


comprehensive ETL tool that offers advanced data integration
and transformation features, including support for real-time data
processing and complex data structures.

• Oracle Data Integrator (ODI): ODI is a powerful ETL tool that


offers a broad range of data integration and transformation features,
including support for Big Data and cloud platforms.

• Apache NiFi: Apache NiFi is an open-source data integration and


transformation tool that provides a flexible, web-based interface for
designing and executing data workflows. It supports a wide range
of data sources and destinations and can be used for real-time data
processing and streaming.

• Azure Data Factory: Azure Data Factory is a cloud-based data


integration service offered by Microsoft Azure. It allows you to
create, schedule, and manage data integration pipelines. It provides
90+ built-in connectors for seamless data integration from various
sources, including on-premises data stores. Azure Data Factory
enables easy design, deployment, and monitoring of data integration

14
Chapter 1 Introduction

pipelines through an intuitive graphical interface or code. This helps


you manage your data more efficiently, reduce operational costs, and
accelerate business insights.

• AWS Glue: AWS Glue is a serverless ETL service by Amazon Web


Services that automates time-consuming ETL tasks for preparing
data for analytics, machine learning, and application development.
It enables you to create data transformation workflows that can
extract, transform, and load data from various sources into data lakes,
warehouses, and other stores. You can use pre-built transformations
or custom code with Python or Scala for ETL. AWS Glue is based on
Apache Spark, allowing for fast and scalable data processing, and
integrates with other AWS services like Amazon S3, Amazon Redshift,
and Amazon RDS. This service simplifies the ETL process and frees
up time for analyzing data to make informed business decisions.

These are just a few examples of the many ETL tools available for data integration and
transformation. The choice of ETL tool depends on the specific needs and requirements of
the organization, as well as the available resources and budget.

ETL and ELT


ETL (extract, transform, load) and ELT (extract, load, transform) are both data
integration techniques that are used to transfer data from source systems to target
systems. The main difference between ETL and ELT lies in the order of the data
processing steps:

• In ETL, data is first extracted from source systems, then transformed


into the desired format, and finally loaded into the target system. This
means that the transformation step takes place outside of the target
system and can involve complex data manipulation and cleansing.

• In ELT, data is first extracted from source systems and loaded into
the target system, and then transformed within the target system.
This means that the transformation step takes place within the target
system, using its processing power and capabilities to transform
the data.

15
Chapter 1 Introduction

• The choice between ETL and ELT depends on several factors,


including the complexity and size of the data being processed, the
capabilities of the target system, and the processing and storage
costs. ETL is more suitable for complex data transformation, while
ELT is more suitable for large data volumes and systems with
advanced processing capabilities.

Data Mart
A data mart is a subset of a larger data warehouse and is designed to serve the needs of a
particular business unit or department. Data marts are used to provide targeted, specific
information to end users, allowing them to make better, data-driven decisions.
A data mart is typically designed to store data that is relevant to a specific business
area or function, such as sales, marketing, or finance. Data marts can be created using
data from the larger data warehouse, or they can be created as standalone systems that
are populated with data from various sources.

Data Mart Architecture


The architecture of a data mart can vary depending on the specific needs of the business
unit or department it serves. However, some common elements are typically found in a
data mart architecture:

• Data sources: The data sources for a data mart can come from
various systems and applications, such as transactional systems,
operational databases, or other data warehouses.

• ETL process: The ETL process is used to extract data from the source
systems, transform it to meet the needs of the data mart, and load it
into the data mart.

• Data mart database: The data mart database is the repository


that stores the data for the specific business unit or department. It
is typically designed to be optimized for the types of queries and
analyses that are performed by the end users. (In modern day,
in some cases this may be a temporarily transformed datastore
refreshed periodically with no history.)

16
Chapter 1 Introduction

• Business intelligence tools: Business intelligence (BI) tools are used


to analyze the data in the data mart and provide reports, dashboards,
and other visualizations to end users.

Advantages of Data Marts


Data marts are a crucial component of modern data management and analytics
strategies, offering several advantages that organizations can leverage to drive informed
decision-making and enhance their competitive edge. These streamlined subsets of data
warehouses are designed to cater to specific business units or departments, providing a
focused and efficient approach to data access and analysis. Some of the key advantages
of a data mart, with examples, are as follows:
• Targeted data: Data marts provide a subset of the larger data
warehouse that is specifically designed to meet the needs of a
particular business unit or department. This makes it easier for end
users to find the data they need and make well-informed decisions.
• Improved performance: Data marts are designed to be optimized
for the types of queries and analyses that are performed by the end
users. This can improve query performance and reduce the time it
takes to access and analyze data.
• Reduced complexity: By focusing on a specific business area or
function, data marts can simplify the data architecture and make it
easier to manage and maintain.

Examples of Data Marts


An organization comprises various departments, including Sales, Marketing, Finance,
etc., each with distinct analytics needs and specific information consumption
requirements. Therefore, to effectively address these diverse needs, different datamarts
are essential.
• Sales data mart: A sales data mart might be used to provide
information on customer orders, product sales, and revenue by
region or by salesperson.

17
Chapter 1 Introduction

• Marketing data mart: A marketing data mart might be used


to provide information on customer demographics, campaign
performance, and customer acquisition costs.
• Finance data mart: A finance data mart might be used to provide
information on budgeting, forecasting, and financial reporting.
In conclusion, data marts are an essential component of a data warehouse, providing
targeted, specific information to end users and enabling them to make better, data-driven
decisions. By designing data marts to meet the specific needs of individual business units
or departments, organizations can improve performance, reduce complexity, and achieve
their business objectives more effectively.

Data Modeling
In the field of data warehousing, there are two main approaches to modeling data:
tabular modeling and dimensional modeling. Both approaches have their strengths
and weaknesses, and choosing the right one for your specific needs is crucial to building
an effective data warehouse.

Tabular Modeling
Tabular modeling is a relational approach to data modeling, which means it organizes
data into tables with rows and columns. This approach is well suited to handling large
volumes of transactional data and is often used in OLTP (online transaction processing)
systems. In a tabular model, data is organized into a normalized schema, where each
fact is stored in a separate table, and the relationships between the tables are established
through primary and foreign keys.
The advantages of tabular modeling include its simplicity, ease of use, and flexibility.
Because data is organized into a normalized schema, it is easier to add or modify data
fields, and it supports complex queries and reporting. However, tabular models can
become more complex to query and maintain as the number of tables and relationships
increases, and it can be slower to process queries on large datasets.

Dimensional Modeling
Dimensional modeling is a more specialized approach that is optimized for OLAP (online
analytical processing) systems. Dimensional models organize data into a star or snowflake
schema, with a fact table at the center and several dimension tables surrounding it.
18
Chapter 1 Introduction

The fact table contains the measures (i.e., numerical data) that are being analyzed, while
the dimension tables provide the context (i.e., descriptive data) for the measures.
Also, dimensional modeling is optimized for query performance, making it well
suited for OLAP and especially reporting systems. Because data is organized into a star
or snowflake schema, it is easier to perform aggregations and analyses, and it is faster to
query large datasets. Dimensional models are also easier to understand and maintain,
making them more accessible to business users. However, dimensional models can
be less flexible and more complex to set up, and they may not perform as well with
transactional data.
In conclusion, both tabular and dimensional modeling have their places in data
warehousing, and the choice between them depends on the specific needs of your
organization. Tabular modeling is more suited to handling large volumes of transactional
data, while dimensional modeling is optimized for OLAP systems and faster query
performance.

In modern warehousing with data and delta lakes, tabular models structured
in facts and dimensions are still effective. There are multiple tools available to
balance between extreme normalization and extreme classification. While tabular
models provide a simpler structure and facilitate querying of data, dimensional
models make it more ready for analytics and reporting needs.

Understanding Dimensional Modeling in Brief


In data warehousing, dimensions, facts, and measures are essential concepts that are
used to organize and analyze data. A dimension is a category of data that provides
context for a fact, while a fact is a value that describes a specific event or activity.
Measures are numerical values that quantify facts.

Dimensions

A dimension is a grouping or category of data that provides context for a fact.


Dimensions can be thought of as the “who, what, when, where, and why” of a dataset.
For example, a sales dataset might include dimensions such as date, product,
customer, and location. Each of these dimensions provides additional information about
the sales data and helps to contextualize it.

19
Chapter 1 Introduction

Dimensions can be further classified into the following types:

• Degenerate Dimension: A degenerate dimension is a dimension


that is not stored in a separate dimension table but is included in the
fact table.

• Conformed Dimension: A conformed dimension is a dimension


that is used in multiple fact tables in the same data warehouse. It is
designed to ensure the consistency and accuracy of data across the
different fact tables. For example, let’s consider a retail company
that sells products through multiple channels, such as brick and
mortar stores, online stores, and mobile apps. The company has a
data warehouse that stores data about sales, inventory, and customer
behavior.

In this scenario, the “customer” dimension is a good example of a


conformed dimension. The customer dimension contains attributes
such as customer name, address, age, gender, and purchase history.
This dimension is used in multiple fact tables, such as sales fact table,
customer behavior fact table, and inventory fact table.

By using a conformed dimension for customer data, the data


warehouse ensures that all the information related to customers is
consistent and accurate across all the fact tables. It also simplifies the
data model and reduces the risk of data inconsistencies and errors.

Another advantage of using conformed dimensions is that they can


be reused across multiple data marts or data domains. This means
that the same customer dimension can be used for sales analysis,
customer behavior analysis, and inventory management analysis
without duplicating the data or creating a separate dimension for
each fact table.

• Junk Dimension: A junk dimension is a collection of flags and


indicators that are not related to any specific dimension. These flags
and indicators are grouped together into a single dimension table to
simplify the data model and improve query performance.

20
Chapter 1 Introduction

Junk dimensions are used when you have many low-cardinality flags
that are not related to any specific dimension, and it’s not worthwhile
to create a separate dimension for each flag.

The name junk comes from the fact that the dimension contains
seemingly unrelated attributes that don’t fit neatly into any other
dimension. Examples of attributes that can be included in a junk
dimension are as follows:

• Boolean indicators: “yes” or “no” flags that describe the


presence or absence of a particular condition

• Flags: “on” or “off” indicators that specify the status of a


particular process or workflow

• Codes: short codes that describe the result of a particular event or


transaction

• Dates: dates or timestamps that indicate when a particular event


occurred

By consolidating these attributes into a single dimension table,


you can simplify the data model and improve query performance.
The junk dimension table acts as a bridge table between the fact
table and the other dimensions in the data model.

For example, let’s consider an e-commerce website that sells


products online. The website has a fact table that records the sales
transactions and several dimensions, such as product, customer,
and time. The fact table contains several low-cardinality flags,
such as “shipped,” “cancelled,” “returned,” and “gift-wrapped,”
which are not related to any specific dimension. Instead of
creating a separate dimension table for each flag, these flags can
be consolidated into a junk dimension table. The junk dimension
table will have a record for each unique combination of these
flags, and the fact table will reference the junk dimension table
using a foreign key.

Junk dimensions can be an effective way to simplify a data


model and reduce the number of dimension tables required in
a data warehouse. However, care should be taken to ensure that
21
Chapter 1 Introduction

the attributes in the junk dimension are truly unrelated and do


not belong in any other dimension. Otherwise, the use of a junk
dimension can lead to data-quality issues and analysis errors.

• Role-Playing Dimension: A role-playing dimension is a dimension


that is used in different ways in the same fact table. For example, a
date dimension can be used to analyze sales by order date, ship date,
or delivery date. Role-playing dimensions are useful when the same
dimension table is used in different contexts with different meanings.
By reusing the same dimension table, the data model can be
simplified, and the data can be more easily analyzed and understood.
However, it’s important to ensure that the meaning of each use of the
dimension table is clearly defined to avoid confusion and errors in
data analysis.

• Slowly Changing Dimension (SCD): A slowly changing dimension


is a dimension that changes slowly over time. SCDs are classified into
six types:

• Type 1 SCD: In a Type 1 SCD, the old values are simply


overwritten with new values when a change occurs. This
approach is suitable for dimensions where historical information
is not required.

Suppose you have a price master table that contains


information about products such as name, price, and details.
If the price of a product changes, you might simply update the
price in the price master table without keeping track of the
historical price.

• Type 2 SCD: In a Type 2 SCD, a new row is added to the


dimension table when a change occurs, and the old row is
marked as inactive. This approach is suitable for dimensions
where historical information is required.
Continuing with the price master table example, if you want
to keep track of the historical price of each product, you might
create a new row for each price change. For example, you

22
Chapter 1 Introduction

might add a new row with a new product version number and
a new price whenever the price changes. This way, you can
keep track of the historical prices of each product.

• Type 3 SCD: In a Type 3 SCD, a limited amount of historical


information is maintained by adding columns to the dimension
table to store previous values. This approach is suitable
for dimensions where only a limited amount of historical
information is required.

Suppose you have an employee table that contains


information about employees such as name, address, and
salary. If an employee gets a promotion, you might add a new
column to the table to store the new job title. You would only
store the most recent job title, and not keep track of historical
job titles.

• Type 4 SCD: Create a separate table to store historical data. This


type of SCD is useful when historical data needs to be stored
separately for performance reasons.

Suppose you have a customer table that contains information


about customers such as name, address, and phone number.
If you want to keep track of historical addresses, you might
create a new table to store the historical addresses. The new
table would contain the customer ID, the old address, and the
date the address was changed.

• Type 5 SCD: Combine SCD Types 1 and 2 by adding an


additional column to track the current and historical values.
This type of SCD can be useful when there are a large number of
historical changes, but only the current value is needed for most
queries.

Continuing with the customer table example, if you want


to keep track of the current and historical phone numbers
for each customer, you might create a new column in the
customer table to store the current phone number, and then
create a new row in a separate phone number table for each

23
Chapter 1 Introduction

phone number change. The phone number table would


contain the customer ID, the phone number, the start date,
and the end date.

• Type 6 SCD: Combine SCD Types 2 and 3 by adding an


additional column to track the current and previous values. This
type of SCD is useful when historical data is important, but only
the current and previous values are needed for most queries.

Suppose you have a product table that contains information


about products such as name, price, and description. If the
price of a product changes, you might create a new row for the
new product version and store the new price in that row. You
might also add a new column to the product table to store the
previous price. This way, you can easily access the current and
previous prices of each product.

• Time Dimension: A time dimension is a special type of dimension


that is used to track time-related data. It provides a way to group and
filter data based on time periods such as hours, days, weeks, months,
and years.

• Hierarchical Dimension: A hierarchical dimension is a dimension


that has a parent–child relationship between its attributes. For
example, a product dimension can have a hierarchy that includes
product category, sub-category, and product.

• Virtual Dimension: A virtual dimension is a dimension that is


created on the fly during query execution. It is not stored in the data
warehouse and is only used for a specific analysis or report.

Facts

A fact is a value that describes a specific event or activity. Facts are typically numeric and
can be aggregated to provide insight into a dataset. For example, a sales dataset might
include facts such as the total sales revenue, the number of units sold, or the average
sales price.

24
Chapter 1 Introduction

Facts are associated with dimensions through the fact table, which contains the
measurements (Measures) for each event or activity. The fact table typically contains
foreign keys to link the fact table to the dimension tables.
Facts are the numerical data that we want to analyze. They are the values that we
measure and aggregate in order to gain insights into our data. In a sales data warehouse,
the facts could include sales revenue, quantity sold, and profit.

Each fact has a corresponding measure that defines the unit of measurement for
the fact. For example, revenue could be measured in dollars, while quantity sold
could be measured in units.

The fact table contains the numerical values, and it is linked to the dimension tables
through foreign key relationships. The fact table is typically wide and has fewer rows
than the dimension table.
Best practices for designing facts include the following:

• Choose appropriate measures: It is important to choose measures


that are meaningful and appropriate for the business.

• Normalize the data: Normalizing the data in the fact table can help
to reduce redundancy and improve performance.

• Use additive measures: Additive measures can be aggregated across


all dimensions, while non-additive measures are specific to a single
dimension.

Measures

Measures are the values that we use to aggregate and analyze the facts. Measures are the
result of applying mathematical functions to the numerical data in the fact table.
For example, measures could include average sales, total sales, and maximum sales.
Measures can be simple or complex, and they can be derived from one or more facts.
Measures can be pre-calculated and stored in the fact table, or they can be calculated on
the fly when the user queries the data warehouse.

25
Chapter 1 Introduction

Best practices for designing measures include the following:

• Choose appropriate functions: It is important to choose functions


that are appropriate for the business and the data being analyzed.

• Use consistent units of measurement: Measures should use


consistent units of measurement to avoid confusion.

• Avoid calculations in the query: Pre-calculate measures that are


frequently used to improve performance.

The dimensional modeling technique provides a powerful method for designing


data warehouses. The key concepts of dimensions, facts, and measures are
essential to the design of a successful data warehouse. By following best practices
for designing dimensions, facts, and measures, you can create a data warehouse
that is easy to use, efficient, and provides meaningful insights into your data.

Schematics Facts and Dimension Structuring


A key aspect of a data warehouse is its schema, which defines the structure of the data in
the warehouse; it’s the way to structure the facts and dimension tables or objects.
There are several different types of schemas that can be used in a data warehouse, as
follows:

• Star Schema: In this schema, the fact table is surrounded by


dimension tables, which contain the attributes of the fact data. This
schema is easy to understand and query, making it a popular choice
for data warehouses.

• Snowflake Schema: This schema is like the star schema, but it


normalizes the dimension tables, which reduces redundancy in the
data. However, this makes the schema more complex and harder
to query.
• Fact Constellation Schema: This schema is a combination of
the star and snowflake schemas, and is used to model complex,
heterogeneous data.

26
Random documents with unrelated
content Scribd suggests to you:
with . As will be seen, the latter condition will be satisfied for all
the electrons in the atoms of elements of low atomic weight and for a
greater part of the electrons contained in the atoms of the other
elements.
If the velocity of the electrons is not small compared with the
velocity of light, the constancy of the angular momentum no longer
involves a constant ratio between the energy and the frequency of
revolution. Without introducing new assumptions, we cannot therefore
in this case determine the configuration of the systems on the basis
of the considerations in Part I. Considerations given later suggest,
however, that the constancy of the angular momentum is the principal
condition. Applying this condition for velocities not small compared
with the velocity of light, we get the same expression for as that
given by (1), while the quantity in the expressions for and is
replaced by and in the expression for by

As stated in Part I., a calculation based on the ordinary mechanics


gives the result, that a ring of electrons rotating round a positive
nucleus in general is unstable for displacements of the electrons in
the plane of the ring. In order to escape from this difficulty, we have
assumed that the ordinary principles of mechanics cannot be used in
the discussion of the problem in question, any more than in the
discussion of the connected problem of the mechanism of binding of
electrons. We have also assumed that the stability for such
displacements is secured through the introduction of the hypothesis
of the universal constancy of the angular momentum of the electrons.
As is easily shown, the latter assumption is included in the
condition of stability in §1. Consider a ring of electrons rotating round
a nucleus, and assume that the system is in dynamical equilibrium
and that the radius of the ring is , the velocity of the electrons
the total kinetic energy , and the potential energy . As shown in
Part I. (p. 21) we have . Next consider a configuration of
the system in which the electrons, under influence of extraneous
forces, rotate with the same angular momentum round the nucleus in
a ring of radius . In this case we have , and on

account of the uniformity of the angular momentum and

. Using the relation , we get

We see that the total energy of the new configuration is greater than
in the original. According to the condition of stability in §1 the system
is consequently stable for the displacement considered. In this
connexion, it may be remarked that in Part I. we have assumed that
the frequency of radiation emitted or absorbed by the systems cannot
be determined from the frequencies of vibration of the electrons, in
the plane of the orbits, calculated by help of the ordinary mechanics.
We have, on the contrary, assumed that the frequency of the radiation
is determined by the condition , where is the frequency,
Planck’s constant, and the difference in energy corresponding to
two different “stationary” states of the system.
In considering the stability of a ring of electrons rotating round a
nucleus for displacements of the electrons perpendicular to the plane
of the ring, imagine a configuration of the system in which the
electrons are displaced by , ,.... , respectively, and
suppose that the electrons, under influence of extraneous forces,
rotate in circular orbits parallel to the original plane with the same
radii and the same angular momentum round the axis of the system
as before. The kinetic energy is unaltered by the displacement, and
neglecting powers of the quantities , .... , higher than the
second, the increase of the potential energy of the system is given by
where is the radius of the ring, the charge on the nucleus, and
the number of electrons. According to the condition of stability in §1
the system is stable for the displacements considered, if the above
expression is positive for arbitrary values of ,.... . By a simple
calculation it can be shown that the latter condition is equivalent to
the condition

where denotes the whole number (smaller than ) for which

has its smallest value. This condition is identical with the condition of
stability for displacements of the electrons perpendicular to the plane
of the ring, deduced by help of ordinary mechanical
considerations[28].
A suggestive illustration is obtained by imagining that the
displacements considered are produced by the effect of extraneous
forces acting on the electrons in a direction parallel to the axis of the
ring. If the displacements are produced infinitely slowly the motion of
the electrons will at any moment be parallel to the original plane of
the ring, and the angular momentum of each of the electrons round
the centre of its orbit will obviously be equal to its original value; the
increase in the potential energy of the system will be equal to the
work done by the extraneous forces during the displacements. From
such considerations we are led to assume that the ordinary
mechanics can be used in calculating the vibrations of the electrons
perpendicular to the plane of the ring—contrary to the case of
vibrations in the plane of the ring. This assumption is supported by
the apparent agreement with observations obtained by Nicholson in
his theory of the origin of lines in the spectra of the solar corona and
stellar nebulæ (see Part I. pp. 6 & 23). In addition it will be shown
later that the assumption seems to be in agreement with experiments
on dispersion.
The following table gives the values of and from
to .
, , ; , ,
1 0 0 9 3.328 13.14
2 0.25 0.25 10 3.863 18.13
3 0.577 0.58 11 4.416 23.60
4 0.957 1.41 12 4.984 30.82
5 1.377 2.43 13 5.565 38.57
6 1.828 4.25 14 6.159 48.38
7 2.305 6.35 15 6.764 58.83
8 2.805 9.56 16 7.379 71.85
We see from the table that the number of electrons which can
rotate in a single ring round a nucleus of charge increases only
very slowly for increasing ; for the maximum value is
; for , ; for , . We see, further,
that a ring of electrons cannot rotate in a single ring round a
nucleus of charge ne unless .
In the above we have supposed that the electrons move under the
influence of a stationary radial force and that their orbits are exactly
circular. The first condition will not be satisfied if we consider a
system containing several rings of electrons which rotate with
different frequencies. If, however, the distance between the rings is
not small in comparison with their radii, and if the ratio between their
frequencies is not near to unity, the deviation from circular orbits may
be very small and the motion of the electrons to a close
approximation may be identical with that obtained on the assumption
that the charge on the electrons is uniformly distributed along the
circumference of the rings. If the ratio between the radii of the rings is
not near to unity, the conditions of stability obtained on this
assumption may also be considered as sufficient.
We have assumed in §1 that the electrons in the atoms rotate in
coaxial rings. The calculation indicates that only in the case of
systems containing a great number of electrons will the planes of the
rings separate; in the case of systems containing a moderate number
of electrons, all the rings will be situated in a single plane through the
nucleus. For the sake of brevity, we shall therefore here only consider
the latter case.
Let us consider an electric charge uniformly distributed along
the circumference of a circle of radius .
At a point distant from the plane of the ring, and at a distance
from the axis of the ring, the electrostatic potential is given by

Putting in this expression and , and using the


notation

we get for the radial force exerted on an electron in a point in the


plane of the ring

where
The corresponding force perpendicular to the plane of the ring at a
distance from the centre of the ring and at a small distance from
its plane is given by

where

A short table of the functions and is given on p. 35.


Next consider a system consisting of a number of concentric rings
of electrons which rotate in the same plane round a nucleus of charge
. Let the radii of the rings be , ,...., and the number of
electrons on the different rings , ,....

Putting , we get for the radial force acting on an

electron in the th ring where

the summation is to be taken over all the rings except the one
considered.
If we know the distribution of the electrons in the different rings,
from the relation (1) on p. 28, we can, by help of the above,
determine , , .... The calculation can be made by successive
approximations, starting from a set of values for the ’s, and from
them calculating the ’s, and then redetermining the ’s by the
relation (1) which gives , and so on.

As in the case of a single ring it is supposed that the systems are


stable for displacements of the electrons in the plane of their orbits. In
a calculation such as that on p. 30., the interaction of the rings ought
strictly to be taken into account. This interaction will involve that the
quantities are not constant, as for a single ring rotating round a
nucleus, but will vary with the radii of the rings; the variation in ,
however, if the ratio between the radii of the rings is not very near to
unity, will be too small to be of influence on the result of the
calculation.
Considering the stability of the systems for a displacement of the
electrons perpendicular to the plane of the rings, it is necessary to
distinguish between displacements in which the centres of gravity of
the electrons in the single rings are unaltered, and displacements in
which all the electrons inside the same ring are displaced in the same
direction. The condition of stability for the first kind of displacements
is given by the condition (5) on p. 31., if for every ring we replace

by a quantity , determined by the condition that is equal to


the component perpendicular to the plane of the ring of the force—
due to the nucleus and the electrons in the other rings—acting on one
of the electrons if it has received a small displacement . Using the
same notation as above, we get

If all the electrons in one of the rings are displaced in the same
direction by help of extraneous forces, the displacement will produce
corresponding displacements of the electrons in the other rings; and
this interaction will be of influence on the stability. For example,
consider a system of concentric rings rotating in a plane round a
nucleus of charge , and let us assume that the electrons in the
different rings are displaced perpendicular to the plane by , ,....
respectively. With the above notation the increase in the
potential energy of the system is given by
The condition of stability is that this expression is positive for arbitrary
values of ,.... . This condition can be worked out simply in the
usual way. It is not of sensible influence compared with the condition
of stability for the displacements considered above, except in cases
where the system contains several rings of few electrons.
The following Table, containing the values of and for
every fifth degree from to , gives an estimate of
the order of magnitude of these functions:—

20 0.132 0.001 0.002


25 0.011
30 0.333 0.021 0.048
35 0.490 0.080 0.217
40 0.704 0.373 1.549
45 1.000 ......... .........
50 1.420 1.708 4.438
55 2.040 1.233 1.839
60 3.000 1.093 1.301
65 4.599 1.037 1.115
70 7.548 1.013 1.041

indicates the ratio between the radii of the rings


. The values of show that unless the ratio of
the radii of the rings is nearly unity the effect of outer rings on the
dimensions of inner rings is very small, and that the corresponding
effect of inner rings on outer is to neutralize approximately the effect
of a part of the charge on the nucleus corresponding to the number of
electrons on the ring. The values of show that the effect of
outer rings on the stability of inner—though greater than the effect on
the dimensions—is small, but that unless the ratio between the radii
is very great, the effect of inner rings on the stability of outer is
considerably greater than to neutralize a corresponding part of the
charge of the nucleus.
The maximum number of electrons which the innermost ring can
contain without being unstable is approximately equal to that
calculated on p. 32. for a single ring rotating round a nucleus. For the
outer rings, however, we get considerably smaller numbers than
those determined by the condition (5) if we replace by the total
charge on the nucleus and on the electrons of inner rings.
If a system of rings rotating round a nucleus in a single-plane is
stable for small displacements of the electrons perpendicular to this
plane, there will in general be no stable configurations of the rings,
satisfying the condition of the constancy of the angular momentum of
the electrons, in which all the rings are not situated in the plane. An
exception occurs in the special case of two rings containing equal
numbers of electrons; in this case there may be a stable configuration
in which the two rings have equal radii and rotate in parallel planes at
equal distances from the nucleus, the electrons in the one ring being
situated just opposite the intervals between the electrons in the other
ring. The latter configuration, however, is unstable if the configuration
in which all the electrons in the two rings are arranged in a single ring
is stable.

§3. Constitution of Atoms containing very few Electrons.

As stated in §1, the condition of the universal constancy of the


angular momentum of the electrons, together with the condition of
stability, is in most cases not sufficient to determine completely the
constitution of the system. On the general view of formation of atoms,
however, and by making use of the knowledge of the properties of the
corresponding elements, it will be attempted, in this section and the
next, to obtain indications of what configurations of the electrons may
be expected to occur in the atoms. In these considerations we shall
assume that the number of electrons in the atom is equal to the
number which indicates the position of the corresponding element in
the series of elements arranged in order of increasing atomic weight.
Exceptions to this rule will be supposed to occur only at such places
in the series where deviation from the periodic law of the chemical
properties of the elements are observed. In order to show clearly the
principles used we shall first consider with some detail those atoms
containing very few electrons.
For sake of brevity we shall, by the symbol , refer
to a plane system of rings of electrons rotating round a nucleus of
charge satisfying the condition of the angular momentum of the
electrons with the approximation used in §2. , ,... are the
numbers of electrons in the rings, starting from inside. By , ,...
and , ,... we shall denote the radii and frequency of the rings
taken in the same order. The total amount of energy emitted by
the formation of the system shall simply be denoted by
.

N=1. Hydrogen.
In Part I. we have considered the binding of an electron by a
positive nucleus of charge , and have shown that it is possible to
account for the Balmer spectrum of hydrogen on the assumption of
the existence of a series of stationary states in which the angular
momentum of the electron round the nucleus is equal to entire
multiples of the value , where is Planck’s constant. The formula
found for the frequencies of the spectrum was

where and are entire numbers. Introducing the values for , ,


and used on p. 29, we get for the factor before the bracket
[29]; the value observed for the constant in the Balmer
spectrum is .
For the permanent state of a neutral hydrogen atom we get from
the formula (1) and (2) in §2, putting
These values are of the order of magnitude to be expected. For
we get , which corresponds to ; the value for the
ionizing potential of a hydrogen atom, calculated by Sir J. J. Thomson
from experiments on positive rays, is [30]. No other definite
data, however, are available for hydrogen atoms. For sake of brevity,
we shall in the following denote the values for , , and
corresponding to the configuration by , and .
At distances from the nucleus, great in comparison with , the
system will not exert sensible forces on free electrons. Since,
however, the configuration:

corresponds to a greater value for than the configuration , we


may expect that a hydrogen atom under certain conditions can
acquire a negative charge. This is in agreement with experiments on
positive rays. Since is only , a hydrogen atom cannot be
expected to be able to acquire a double negative charge.

N=2. Helium.
As shown in Part I., using the same assumptions as for hydrogen,
we must expect that during the binding of an electron by a nucleus of
charge a spectrum is emitted, expressed by
This spectrum includes the spectrum observed by Pickering in the
star Puppis and the spectra recently observed by Fowler in
experiments with vacuum tubes filled with a mixture of hydrogen and
helium. These spectra are generally ascribed to hydrogen.
For the permanent state of a positively charged helium atom, we
get

At distances from the nucleus great compared with the radius of


the bound electron, the system will, to a close approximation, act
on an electron as a simple nucleus of charge . For a system
consisting of two electrons and a nucleus of charge , we may
therefore assume the existence of a series of stationary states in
which the electron most lightly bound moves approximately in the
same way as the electron in the stationary states of a hydrogen atom.
Such an assumption has already been used in Part I. in an attempt to
explain the appearance of Rydberg’s constant in the formula for the
line-spectrum of any element. We can, however, hardly assume the
existence of a stable configuration in which the two electrons have
the same angular momentum round the nucleus and move in different
orbits, the one outside the other. In such a configuration the electrons
would be so near to each other that the deviations from circular orbits
would be very great. For the permanent state of a neutral helium
atom, we shall therefore adopt the configuration

Since

we see that both electrons in a neutral helium atom are more firmly
bound than the electron in a hydrogen atom. Using the values on p.
38, we get
these values are of the same order of magnitude as the value
observed for the ionization potential in helium, assume[31],
and the value for the frequency of the ultra-violet absorption in helium
determined by experiments on dispersion [32].

The frequency in question may be regarded as corresponding to


vibrations in the plane of the ring (see p. 30). The frequency of
vibration of the whole ring perpendicular to the plane, calculated in
the ordinary way (see p. 32), is given by . The fact that
the latter frequency is great compared with that observed might
explain that the number of electrons in a helium atom, calculated by
help of Drude’s theory from the experiments on dispersion, is only
about two-thirds of the number to be expected. (Using
the value calculated is .)

For a configuration of a helium nucleus and three electrons, we


get

Since for this configuration is smaller than for the configuration


, the theory indicates that a helium atom cannot acquire a
negative charge. This is in agreement with experimental evidence,
which shows that helium atoms have no “affinity” for free
electrons[33].
In a later paper it will be shown that the theory offers a simple
explanation of the marked difference in the tendency of hydrogen and
helium atoms to combine into molecules.

N=3. Lithium.
In analogy with the cases of hydrogen and helium we must expect
that during the binding of an electron by a nucleus of charge ,a
spectrum is emitted, given by

On account of the great energy to be spent in removing all the


electrons bound in a lithium atom (see below) the spectrum
considered can only be expected to be observed in extraordinary
cases.
In a recent note Nicholson[34] has drawn attention to the fact that
in the spectra of certain stars, which show the Pickering spectrum
with special brightness, some lines occur the frequencies of which to
a close approximation can be expressed by the formula

where is the same constant as in the Balmer spectrum of


hydrogen. From analogy with the Balmer- and Pickering-spectra,
Nicholson has suggested that the lines in question are due to
hydrogen.
It is seen that the lines discussed by Nicholson are given by the
above formula if we put . The lines in question correspond to
; if we for put , we
get lines coinciding with lines of the ordinary Balmer-spectrum of
hydrogen. If we in the above formula put , we get
series of lines in the ultra-violet. If we put we get only a single
line in visible spectrum, viz.: for which gives
, or a wave-length closely
coinciding with the wave-length of one of the lines
of unknown origin in the table quoted by Nicholson. In this table,
however, no lines occur corresponding to .
For the permanent state of a lithium atom with two positive
charges we get a configuration
The probability of a permanent configuration in which two
electrons move in different orbits around each other must for lithium
be considered still less probable than for helium, as the ratio between
the radii of the orbits would be still nearer to unity. For a lithium atom
with a single positive charge we shall, therefore, adopt the
configuration:

Since , we see that the first two


electrons in a lithium atom are very strongly bound compared with the
electron in a hydrogen atom; they are still more rigidly bound than the
electrons in a helium atom.
From a consideration of the chemical properties we should expect
the following configuration for the electrons in a neutral lithium atom:

This configuration may be considered as highly probable also


from a dynamical point of view. The deviation of the outermost
electron from a circular orbit will be very small, partly on account of
the great values of the ratio between the radii, and of the ratio
between the frequencies of the orbits of the inner and outer electrons,
partly also on account of the symmetrical arrangement of the inner
electrons. Accordingly, it appears probable that the three electrons
will not arrange themselves in a single ring and form the system:

although for this configuration is greater than for .


Since , we see that the outer
electron in the configuration is bound even more lightly than the
electron in a hydrogen atom. The difference in the firmness of the
binding corresponds to a difference of in the ionization
potential. A marked difference between the electron in hydrogen and
the outermost electron in lithium lies also in the greater tendency of
the latter electron to leave the plane of the orbits. The quantity
considered in §2, which gives a kind of measure for the stability for
displacements perpendicular to this plane, is thus for the outer
electron in lithium only , while for hydrogen it is . This may have
a bearing on the explanation of the apparent tendency of lithium
atoms to take a positive charge in chemical combinations with other
elements.
For a possible negatively charged lithium atom we may expect the
configuration:

It should be remarked that we have no detailed knowledge of the


properties in the atomic state, either for lithium or hydrogen, or for
most of the elements considered below.

N=4. Beryllium.
For reasons analogous to those considered for helium and lithium
we may for the formation of a neutral beryllium atom assume the
following stages:
although the configurations:

correspond to less values for the total energy than the configurations
and .
From analogy we get further for the configuration of a possible
negatively charged atom,

Comparing the outer ring of the atom considered with the ring of a
helium atom, we see that the presence of the inner ring of two
electrons in the beryllium atom markedly changes the properties of
the outer ring; partly because the outer electrons in the configuration
adopted for a neutral beryllium atom are more lightly bound than the
electrons in a helium atom, and partly because the quantity , which
for helium is equal to , for the outer ring in the configuration is
only equal to .
Since , the beryllium atom will
further have a definite, although very small affinity for free electrons.
§4. Atoms containing greater numbers of electrons.

From the examples discussed in the former section it will appear


that the problem of the arrangement of the electrons in the atoms is
intimately connected with the question of the confluence of two rings
of electrons rotating round a nucleus outside each other, and
satisfying the condition of the universal constancy of the angular
momentum. Apart from the necessary conditions of stability for
displacements of the electrons perpendicular to the plane of the
orbits, the present theory gives very little information on this problem.
It seems, however, possible by the help of simple considerations to
throw some light on the question.
Let us consider two rings rotating round a nucleus in a single
plane, the one outside the other. Let us assume that the electrons in
the one ring act upon the electrons in the other as if the electric
charge were uniformly distributed along the circumference of the ring,
and that the rings with this approximation satisfy the condition of the
angular momentum of the electrons and of stability for displacements
perpendicular to their plane.
Now suppose that, by help of suitable imaginary extraneous
forces acting parallel to the axis of the rings, we pull the inner ring
slowly to one side. During this process, on account of the repulsion
from the inner ring, the outer will move to the opposite side of the
original plane of the rings. During the displacements of the rings the
angular momentum of the electrons round the axis of the system will
remain constant, and the diameter of the inner ring will increase while
that of the outer will diminish. At the beginning of the displacement
the magnitude of the extraneous forces to be applied to the original
inner ring will increase but thereafter decrease, and at a certain
distance between the plane of the rings the system will be in a
configuration of equilibrium. This equilibrium, however, will not be
stable. If we let the rings slowly return they will either reach their
original position, or they will arrive at a position in which the ring,
which originally was the outer, is now the inner, and vice versa.
If the charge of the electrons were uniformly distributed along the
circumference of the rings, we could by the process considered at
most obtain an interchange of the rings, but obviously not a junction
of them. Taking, however, the discrete distribution of the electrons
into account, it can be shown that, in the special case when the
number of electrons on the two rings are equal, and when the rings
rotate in the same direction, the rings will unite by the process,
provided that the final configuration is stable. In this case the radii
and the frequencies of the rings will be equal in the unstable
configuration of equilibrium mentioned above. In reaching this
configuration the electrons in the one ring will further be situated just
opposite the intervals between the electrons in the other, since such
an arrangement will correspond to the smallest total energy. If now
we let the rings return to their original plane, the electrons in the one
ring will pass into the intervals between the electrons in the other, and
form a single ring. Obviously the ring thus formed will satisfy the
same condition of the angular momentum of the electrons as the
original rings.
If the two rings contain unequal numbers of electrons the system
will during a process such as that considered behave very differently,
and, contrary to the former case, we cannot expect that the rings will
flow together, if by help of extraneous forces acting parallel to the axis
of the system they are displaced slowly from their original plane. It
may in this connexion be noticed that the characteristic for the
displacements considered is not the special assumption about the
extraneous forces, but only the invariance of the angular momentum
of the electrons round the centre of the rings; displacements of this
kind take in the present theory a similar position to arbitrary
displacements in the ordinary mechanics.
The above considerations may be taken as an indication that
there is a greater tendency for the confluence of two rings when each
contains the same number of electrons. Considering the successive
binding of electrons by a positive nucleus, we conclude from this that,
unless the charge on the nucleus is very great, rings of electrons will
only join together if they contain equal numbers of electrons; and that
accordingly the numbers of electrons on inner rings will only be , ,
,.... If the charge of the nucleus is very great the rings of electrons
first bound, if few in number, will be very close together, and we must
expect that the configuration will be very unstable, and that a gradual
interchange of electrons between the rings will be greatly facilitated.
This assumption in regard to the number of electrons in the rings
is strongly supported by the fact that the chemical properties of the
elements of low atomic weight vary with a period of . Further, it
follows that the number of electrons on the outermost ring will always
be odd or even, according as the total number of electrons in the
atom is odd or even. This has a suggestive relation to the fact that the
valency of an element of low atomic weight always is odd or even
according as the number of the element in the periodic series is odd
or even.
For the atoms of the elements considered in the former section we
have assumed that the two electrons first bound are arranged in a
single ring, and, further, that the two next electrons are arranged in
another ring. If the configuration will correspond to a
smaller value for the total energy than the configuration . The
greater the value of the closer will the ratio between the radii of the
rings in the configuration approach unity, and the greater will
be the energy emitted by an eventual confluence of the rings. The
particular member of the series of the elements for which the four
innermost electrons will be arranged for the first time in a single ring
cannot be determined from the theory. From a consideration of the
chemical properties we can hardly expect that it will have taken place
before boron ( or carbon , on account of the
observed trivalency and tetravalency respectively of these elements;
on the other hand, the periodic system of the elements strongly
suggests that already in neon an inner ring of eight
electrons will occur. Unless the configuration
corresponds to a smaller value for the total energy than the
configuration ; already for the latter configuration,
however, will be stable for displacements of the electrons
perpendicular to the plane of their orbits. A ring of electrons will
not be stable unless is very great; but in such a case the simple
considerations mentioned above do not apply.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookmass.com

You might also like