An Empirical Analysis of California Data Breaches

An Empirical Analysis of California Data Breaches
Richard Chen Zakir Durumeric

Stanford University Stanford University
Abstract each month compared to the previous month. We found pat-

terns in company profiles, attack vectors, and personal infor-
Data breaches have steadily become more frequent over the mation stolen across data breaches.
last several years. Under California’s data breach notifica- For company profiles, we found that the eight industries
tion law, all companies serving California residents who had most frequently affected by data breaches accounted for over
their data stolen in a breach are required to disclose a breach half of all data breaches, and the 25 most-affected industries
report detailing the incident. We empirically analyze the accounted for over 80% of all data breaches. Large busi-
public dataset of California data breach notifications, which nesses (1,000+ employees) accounted for over half of all data
contains 1,437 breach incidents between January 2012 and breaches.
September 2018, to find patterns in the types of companies For attack vectors, other than unauthorized access, the
breached, attack vectors, and information stolen. We find most common attack vectors were software vulnerability,
that the financial services industry and large companies with stolen computer or data, and data found publicly. Ran-
over 10,000 employees are most likely to be breached. Soft- somware and phishing email were two fairly recent attack
ware vulnerability is the most common descriptive attack vectors that commonly occur since 2016. When comparing
vector. Social security numbers and payment cards are by attack vectors to industries, there was often a single attack
far the two most common personal information stolen. We vector that accounted for most data breaches in a given in-
also show how attack vectors and information stolen tend to dustry.
be predictable based on the company’s profile. For personal information stolen, the two most common
were social security numbers and payment cards (credit/debit
1 Introduction cards). This generally was true across all industries except
for the industries that dealt with medical records.
The number of data breaches continues to get worse over
time. According to Risk Based Security, 2017 was the “worst 2 Background
year on record” for data breach activity, with over 1,200 data
breaches and over 3.4 billion records exposed nationwide. Data breach notification laws can be thought of as a laissez-
[1] We know the exact number of data breaches that occur faire accountability model that forces organizations to un-
thanks to data breach notification laws, which require com- derstand their security risks. Organizations can make secu-
panies that have been breached to report breach incidents to rity decisions on their own but must disclose data breaches
state governments. Yet to our knowledge, there has been no if their decisions result in a security failure. Disclosure cre-
study that analyzes such publicly available dataset. ates accountability inside an organization not only by rais-
In this study, we analyzed all data breaches reported to the ing awareness but also by defining costs for organizations to
California Department of Justice website. From a dataset of avoid in the form of notification expenses and adverse pub-
1,437 data breaches between January 20, 2012 and Septem- licity. These laws haven’t eliminated data breaches but have
ber 21, 2018, we gathered data about the company’s pro- helped mitigate their impact.
file, the attack vector, and information stolen for each data The California Security Breach and Information Act (S.B.
breach. Over this time period, there was an average of 1386) of July 2003 established the first-ever data breach no-
18 data breaches per month with a maximum of 60 data tification law. The law requires any business or state agency
breaches in February 2017. The number of data breaches has to notify any California resident whose personal information
been steadily increasing at a rate of 0.18 more data breaches was acquired or reasonably believed to have been acquired
by an unauthorized person. The law only applies to infor- 3 Related Work
mation that is either (a) not encrypted or (b) encrypted if an
encryption key is also compromised. [2] Prior studies have mainly focused on the cost of data
“Personal information” is defined as either of the follow- breaches to companies.
ing: [3] The 2018 Cost of a Data Breach Study: Global Overview,
conducted by IBM Security and Ponemon Institute, surveyed
(A) An individual’s first name or first initial and his or her more than 2,200 IT, data protection, and compliance profes-
last name in combination with any one or more of the sionals from 477 companies that experienced a data breach
following data elements, when either the name or the in the past 12 months. According to the report, data breaches
data elements are not encrypted or redacted: continue to be costlier and result in more consumer records
being lost or stolen cumulatively every year. The key find-
ings were: [7]
(i) Social security number.
• The average total cost of a data breach in 2018 rose
(ii) Driver’s license number or California identifica- from $3.62 million to $3.86 million, an increase of
tion card number. 6.4% from 2017.
(iii) Account number, credit or debit card number, in • The average cost for each lost record in 2018 rose from
combination with any required security code, ac- $141 to $148, an increase of 4.8% from 2017.
cess code, or password that would permit access
to an individual’s financial account. • The average size of data breaches increased by 2.2%.
(iv) Medical information. In addition to presenting trends in the cost of data

breaches, the study also determined a 27.9% likelihood that
(v) Health insurance information. an organization breached today will be breached again in the
next two years. [7]
(B) A username or email address in combination with a Lastly, the study reported on the relationship between how
password or security question and answer that would quickly an organization identifies and contains a data breach
permit access to an online account. and its financial consequences. The average time to iden-
tify a breach was 197 days and the average time to contain
The law also requires that a sample copy of any breach no- a breach was 69 days. Companies that contained a breach
tice sent to more than 500 California residents be provided to in under 30 days saved over $1 million compared to those
the California Attorney General. [2] (In some cases, the or- that took more than 30 days to resolve the breach. The study
ganization that sent the notice is not the one that experienced revealed a reduction in cost when companies participate in
the breach. For example, a bank may notify of a credit card threat sharing activities and deploy data loss prevention tech-
number breach that occurred at a merchant, not the bank.) nologies. [7]
The law has had an enormous impact on providing trans- Data breaches are now a consistent cost of doing busi-
parency around security failures. In 2004, there were only ness. The biggest financial consequence to organizations
three publicized data breaches for publicly traded compa- that experience a data breach is lost business. Industries
nies. In 2005, when the California law went into effect, there such as healthcare and financial services have the costliest
were 51. [4] California’s law also prompted every other state data breaches because of fines and loss of business. [7] The
to pass similar legislation in the absence of a single federal costs beyond settlement with banks include legal support,
data breach notification law, with Alabama being the last forensic investigation, data and network restoration, compli-
state to pass a data breach notification law in March 2018. ance with breach notification laws, business interruption, and
[5] Such notification to consumers and state authorities gave post-breach marketing to restore reputation. [8]
law enforcement, researchers, and others better data for un-
derstanding the nature and scope of the data breach problem 4 Methodology
instead of relying on reports from media outlets, which don’t
cover every breach that occurs. Since companies are forced to disclose data breaches, the
Finally, these laws have sparked entire new industries to California state government has a comprehensive dataset of
help organizations prevent data breaches and respond appro- all companies serving California residents that have been
priately if they occur. As an example, cyber insurance is a breached. In this study, we looked at all California data
fairly recent industry that protects businesses from risks re- breach notifications that are publicly available on the Cali-
lating to data breaches and cyber attacks. The market for fornia Department of Justice website. The website lists de-
cyber insurance premiums totaled $5 billion in 2018 and is tailed breach notification reports for all data breaches that
expected to double in the next five years. [6] have been reported since January 20, 2012. [9]
2
We collected a dataset of 1,437 breach incidents that were saving files to a non-work cloud storage. Some pur-
reported between January 20, 2012 and September 21, 2018. poses are for committing fraud, identity theft, or theft
of trade secrets.
4.1 Company Data • Lost Computer or Data: An employee loses his/her
We labeled each company that was breached with its “indus- unencrypted computer, physical records of personal in-
try,” “company type,” and “company size” using LinkedIn’s formation are found missing, or mail containing per-
dataset on companies. “Industry” is based on LinkedIn’s In- sonal information is lost in transit.
dustry Codes. [10] “Company type” is one of the follow-
• Phishing Email: An employee is mislead into entering
ing: educational institution, government, nonprofit, partner-
his/her credentials into a spoofed login page.
ship, privately held, public company, self-employed, or sole
proprietorship. “Company size” is one of the following: 1- • Ransomware: Malware encrypts a company network’s
10, 11-50, 51-200, 201-500, 501-1,000, 1,001-5,000, 5,001- files and demands ransom for the files to be decrypted.
10,000, or 10,000+.
While we were able to label every company with its indus- • Social Engineering: A spoofed email impersonates the
try, LinkedIn’s dataset did not have the company type or size CEO or a high-level company executive to mislead an
for every company. Only 75.5% of companies were labeled employee into sending personal information, or an at-
with “company type” and 83.4% of companies were labeled tacker misleads customer support into giving access to
with “company size.” a user’s account.
• Software Vulnerability: Vulnerabilities include web

4.2 Data Breach Reports vulnerabilities (e.g. SQL injection, XSS attack) and un-
California law requires that data breach reports follow a stan- patched third-party software or libraries (e.g. Apache
dardized form. The “What Happened?” section must include Struts vulnerability).
“a general description of the breach incident, if that infor- • Stolen Computer or Data: An employee’s unen-
mation is possible to determine at the time the notice is pro- crypted computer or physical records containing per-
vided,” and the “What Information Was Involved?” section sonal information is stolen.
must include “a list of the types of personal information that
were or are reasonably believed to have been the subject of a • Stolen Credentials: An account’s password is the same
breach.” [11] one used on another compromised website, or the pass-
For the “What Happened?” section, we classified each word is weak and easily brute-forced.
breach incident into one of the following attack vectors:
• Unauthorized Access: A catch-all term for vague data
• Compromised Email: A compromised email account breach reports that follow the general form: “We de-
allows the attacker to gain access to every website that tected unauthorized access to our network where some
uses the email as a login. personal information may have been exposed.”
• Compromised Machine: Physical machines (e.g. • Wrong Data Sent: An employee accidentally sends
point-of-sale credit/debit card terminals, ATM ma- personal information or the wrong personal information
chines) are hacked using methods such as card skim- to an external third-party.
mers.
For the “What Information Was Involved?” section, we
• Data Found Publicly: Personal information is found compiled a list of “personal information” (defined earlier in
online by third-parties or in the physical garbage bin Section 2) that was affected by each breach incident. Other
without being shredded. affected information, such as date of birth and address, could
be voluntarily disclosed in the breach report but is not re-
• Exposed Data: (1) Misconfigured privileges causes
quired by law, so we did not consider other affected infor-
a database or files to be exposed publicly online and
mation in our study due to voluntary response bias.
possibly searchable by Google or enables an employee
without proper authorization to access the files. (2) A
software bug causes a user’s personal information to be 5 Results
displayed to other users.
5.1 Company Profiles
• Insider Theft: A current or former employee exfiltrates
personal information such as by sending files to a non- The companies breached most often were American Express
work email, taking physical records or hard drives, or (5.9%) and Discover Financial Services (1.8%), two major
3
credit card companies. This is not surprising given that both ter January 2016. Prior to January 2016, there were only 4
companies are required to notify their customers every time reported instances of accounting firms being breached.
a dataset of credit card information is found publicly online
(see “Data Found Publicly” in Section 4.2). 5.4% of com-
panies were breached more than once during the time period 5.2 Attack Vectors
between January 20, 2012 and September 21, 2018. The most common attack vector is the generic catch-all term
The top eight industries accounted for over 50% of all data “unauthorized access” (27.0%) because many data breach
breaches across 98 different industries: financial services reports did not explain the specific attack vector. For the
(17.6%), hospital & health care (9.5%), retail (5.4%), hos- data breach reports that did explain how the company was
pitality (4.6%), higher education (4.5%), insurance (3.5%), breached, software vulnerability (13.1%), stolen computer or
medical practice (3.4%), and accounting or government ad- data (11.4%), data found publicly (11.1%), wrong data sent
ministration (tied 4.1%). The top 25 industries accounted for (7.3%), and exposed data (7.2%) accounted for over half of
over 80% of all data breaches. [Figure 1] all attack vectors. [Figure 4]
An overwhelming majority of breached companies were Some attack vectors were concentrated within a small time
either privately held (37.0%) or public company (34.8%). frame. For compromised machine attacks, there was a spike
The remaining company types were nonprofit (11.5%), ed- of 39 incidents in February 2017; excluding that month,
ucational institution (6.9%), government agency (6.0%), compromised machine attacks only averaged 1.35 incidents
partnership (1.9%), sole proprietorship (1.6%), and self- per month. This spike was the result of an attacker installing
employed (0.3%). credit card skimmers on the point-of-sale payment terminals
The majority of data breaches came from large com- for several Acme Car Wash and Clearwater Express stations.
panies with 10,000+ employees (30.3%). Including the Similarly, in October 2016, there were 13 incidents of wrong
5,001-10,000 range (5.3%) and 1,001-5,000 range (16.7%), data sent, compared to the normal average of 1.41 incidents
large businesses altogether accounted for 52.3% of all data per month. This was the result of insurance company Em-
breaches. [Figure 2] This is contrary to prior claims that two- blemHealth inadvertently printing customers’ SSNs on the
thirds of all data breaches come from small to medium-size external mailing labels of packages, which happened repeat-
businesses (SMBs). [12] However, there may be some re- edly for multiple days throughout October before the com-
sponse bias in the data since SMBs are less likely to report pany finally discovered the error.
data breaches, even if required by law, in scenarios such as Some attack vectors were fairly recent phenomenons.
when an employee loses a laptop containing personal infor- Ransomware attacks started happening in July 2016 with
mation. hospitals and medical practices being the primary targets.
Prior work found that companies that contain a data breach Before then, there was only a single reported incident of ran-
in under 30 days save over $1 million compared to those somware, which affected the law firm Ziprick & Cramer,
that take more than 30 days to resolve. [7] According to our LLP in January 2015. Likewise, phishing email attacks
findings, only 21.5% of data breaches were reported within started happening consistently every month since February
30 days. While the median report time was 78 days, the 2016, averaging 2.03 incidents per month. Before then, there
distribution of report times was heavily skewed right such were only scattered incidents of phishing email attacks, av-
that the average report time was 175 days. The longest time eraging just 0.16 incidents per month. [Figure 5]
it took to report a data breach was 7 years, 6 months, and 9 There is usually a single attack vector that accounts for a
days (2,747 days). large number of data breaches in each industry. Data found
18.6% of companies that reported breaches were unable publicly was by far the largest cause of data breaches for
to ascertain the exact date(s) when the data breach occurred. financial services companies (63.8%). Others include: soft-
For those that were able to, there was an average of 18 data ware vulnerability for apparel & fashion (62.5%), consumer
breaches per month with a maximum of 60 data breaches in goods (60.0%), and retail (52.5%); compromised machine
February 2017. (June through September 2018 may be un- for hospitality (57.9%) and restaurants (57.1%), stolen com-
derreported since it takes on average 175 days to report a puter or data for medical practice (52.6%); and exposed data
data breach that occurred.) The number of data breaches has for computer software (50.0%). [Figure 6] The data also cor-
been steadily increasing at a rate of 0.18 more data breaches roborated prior work that showed internal negligence was to
each month compared to the previous month. There was blame for most data breaches involving personal health in-
also a slight seasonal pattern in data breaches with a small formation. [13]
increase in the number of data breaches during February Similarly, there is usually a single industry that accounts
through April. [Figure 3] for a large number of data breaches for each attack vector.
Accounting was the only industry with a significant Financial services was by far the largest industry for the data
change in frequency of data breaches over time. 94.2% of found publicly attack vector (90.3%). Others include: hos-
all data breaches that affected accounting firms happened af- pital & health care for lost computer or data (35.7%); hos-
4
Figure 1: Frequency of data breaches for the top 25 industries out of 98 different industries. The top 8 industries accounted for
over half of all data breaches, and the top 25 industries accounted for over 80% of all data breaches.
pitality for compromised machine (34.4%); medical practice

for ransomware (33.3%); and internet for stolen credentials
(32.4%). [Figure 6]
Most attack vectors most commonly occurred in large
companies with 10,000+ employees. This supports the no-
tion that a large attack surface increases the likelihood of an
attack happening regardless of the specific attack vector. For
instance, it becomes more likely that an employee inadver-
tently sends personal information to the wrong person the
more employees an organization has.
The main exceptions were software vulnerability and ran-
somware. 24.7% of software vulnerability attacks affected
businesses with 51-200 employees. 40.0% of ransomware
attacks affected small businesses with 1-10 employees; these
were doctor offices that relied on a handful of insecure soft-
ware systems to store their medical records. This data runs
contrary to the narrative that large organizations are primar-
ily the targets of ransomware attacks, such as when the Not-
Petya ransomware crippled the network of the multinational
shipping giant Maersk and cost $250-$300 million in dam-
Figure 2: Frequency of data breaches per company size. ages. [14]
Large businesses (1,000+ employees) accounted for over
half of all data breaches.
5.3 Personal Information Stolen
The two most common personal information stolen by far
were social security numbers (42.4%) and payment cards in-
cluding credit/debit cards (41.1%). Other information stolen
included medical records (14.8%), passwords of other users
in addition to the compromised account (11.6%), bank rout-
ing and account numbers (10.0%), health insurance informa-
5
Figure 3: Number of data breaches per month since January 2012, broken down by each attack vector. There was an average
of 18 data breaches per month with a maximum of 60 data breaches in February 2017. The number of data breaches has been
steadily increasing at a rate of 0.18 more data breaches each month compared to the previous month.
Figure 5: Frequency of ransomware and phishing email over

time. Both have occurred much more commonly recently,
with ransomware starting in July 2016 and phishing email
starting in February 2016.
tion (8.7%), and driver’s license numbers (7.9%). [Figure

7]
There were three spikes in theft of payment card informa-
tion in April 2015, February 2017, and March 2017. 14% of
all payment card thefts happened within those three months.
Figure 4: Frequency of data breaches per attack vector. The These concentrated data breaches occurred because many
most common attack vector is the generic catch-all term companies within the same industry were using the same
“unauthorized access” because many data breach reports did payment card processor – whether a point-of-sale payment
not explain the specific attack vector. terminal or a software that stores payment card information
– that got compromised.
In April 2015, 23 wineries were using the Missing Link
direct sales software system to store payment card informa-
tion, which was accessed by an unauthorized third-party. In
February 2017, several Acme Car Wash and Clearwater Ex-
press stations were using the same point-of-sale payment
terminals that were compromised with card skimmers. In
6
Figure 8: Frequency of each information stolen for the top
25 industries. In general, social security numbers and pay-
ment cards were the two most common personal information
stolen regardless of the industry, except for the industries that
dealt with medical records.
March 2017, 24 hotels were using the Sabre SynXis Central

Figure 6: Frequency of each attack vector for the top 25 in- Reservations system to facilitate the booking of hotel reser-
dustries. The scattered green squares suggest that there is a vations, in which stolen credentials enabled an attacker to
single attack vector that accounts for most data breaches for steal payment card information. These incidents show that
each industry and likewise a single industry that accounts for relying on a single vendor to process personal information,
most data breaches for each attack vector. (Unauthorized ac- such as payment cards, creates a single point of failure risk.
cess is omitted since it is a non-descriptive attack vector.)
Bank account numbers have also become more frequently
stolen in recent years. There were three times as many re-
ported incidents after November 2015 compared to before
November 2015.
Social security numbers and payment cards were the two

most common personal information stolen across the top 25
industries. The few exceptions were medical records for hos-
pital & health care (46.5%) and medical practice (36.8%), as
well as passwords for Internet companies (54.3%). The fi-
nancial services industry was the industry most commonly
affected by stolen social security numbers (19.5%), payment
cards (31.8%), bank account numbers (36.8%), and driver’s
license numbers (27.1%). The hospital & health care in-
dustry was the industry most commonly affected by stolen
medical records (51.9%) and health insurance information
(30.8%). The Internet industry was the industry most com-
monly affected by stolen passwords (21.2%). [Figure 8]
Figure 7: Frequency of each personal information stolen.
Social security numbers and payment cards were also the
The two most common personal information stolen by far
two most common personal information stolen across all
were social security numbers and payment cards.
company sizes with the exception of health insurance infor-
mation, which was most common for small businesses with
1-10 employees.
7
6 Conclusion and Future Work [9] “Search Data Security Breaches.” State of
California Department of Justice, DOJ,
The attack vectors and information stolen in data breaches oag.ca.gov/privacy/databreach/list.
tend to follow a predictable pattern depending on the com-
pany’s profile. For instance, we found that for many indus- [10] “Industry Codes.” LinkedIn Developers, LinkedIn Cor-
tries, the types of attack vectors and information stolen are poration.
concentrated in only a few categories. Based on our findings,
[11] California Civil Code. Title 1.81, Section 1798.82.
we can better predict how a company is going to be breached
and what information is at risk of getting stolen. This is very [12] Boneh, Dan. “Why Is Computer Security Difficult?”
useful for not only high-risk organizations but also cyber in- Cybersecurity: A Legal and Technical Perspective. 3
surance underwriters that have to create cyber risk models to Apr. 2018, Stanford, California.
determine premiums based on the company’s profile.
There are many possible areas for future work. This study [13] “Internal Negligence to Blame for Most Data Breaches
only focused on California data breaches, but we could ex- Involving Personal Health Information.” Help Net Se-
tend this study to compare California data breaches to those curity, Help Net Security, 25 Nov. 2018.
from other states, since all states have a data breach notifi-
[14] Greenberg, Andy. “The Untold Story of NotPetya,
cation law. There may be notable differences because many
the Most Devastating Cyberattack in History.” Wired,
tech companies are located in California.
Conde Nast, 24 Oct. 2018.
Furthermore, we could assess the financial damage of
data breaches. Some data breaches do not materially af-
fect the company’s bottom line, while others, such as the
Equifax data breach, greatly impact the company’s finan-
cials. With this information, we can ascertain what types of
data breaches cause more financial damage than others, and
whether the severeness of financial damage correlates with
the company’s profile.
Lastly, for data breaches that were able to attribute who
was responsible, we could figure out if and/or how such data
was used maliciously.
7 References
[1] “Data Breach Activity Reaches All-Time High.” Help
Net Security, Help Net Security, 23 May 2017.
[2] “Data Breach Charts.” BakerHostetler, July 2018.
[3] California Civil Code. Title 1.81, Section 1798.81.5.
[4] Singer, P.W., and Allan Friedman. Cybersecurity and

Cyberwar: What Everyone Needs to Know. Oxford
University Press, 2014.
[5] Heck, Zachary. “Alabama Rolls with Tide as Last State

to Adopt Breach Notification Law.” Lexology, Lexol-
ogy, 30 Apr. 2018.
[6] Betterley, Richard S. “Cyber/Privacy Insurance Market

Survey—2018.” The Betterley Report, June 2018.
[7] “2018 Cost of a Data Breach Study: Global Overview.”

Ponemon Institute, July 2018.
[8] Williams, Gareth, Robert E. Schulz, David C. Tesher,

and Laurence P. Hazell. “Cyber Risk and Corporate
Credit.” RatingsDirect. 9 June 2015.

An Empirical Analysis of California Data Breaches

Uploaded by

Copyright:

Available Formats

An Empirical Analysis of California Data Breaches

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Empirical Analysis of California Data Breaches

Uploaded by

Copyright:

Available Formats

An Empirical Analysis of California Data Breaches

Richard Chen Zakir Durumeric

Abstract each month compared to the previous month. We found pat-

(iv) Medical information. In addition to presenting trends in the cost of data

• Software Vulnerability: Vulnerabilities include web

pitality for compromised machine (34.4%); medical practice

Figure 5: Frequency of ransomware and phishing email over

tion (8.7%), and driver’s license numbers (7.9%). [Figure

March 2017, 24 hotels were using the Sabre SynXis Central

Social security numbers and payment cards were the two

[2] “Data Breach Charts.” BakerHostetler, July 2018.

[3] California Civil Code. Title 1.81, Section 1798.81.5.

[4] Singer, P.W., and Allan Friedman. Cybersecurity and

[5] Heck, Zachary. “Alabama Rolls with Tide as Last State

[6] Betterley, Richard S. “Cyber/Privacy Insurance Market

[7] “2018 Cost of a Data Breach Study: Global Overview.”

[8] Williams, Gareth, Robert E. Schulz, David C. Tesher,

You might also like