An Empirical Analysis of California Data Breaches
An Empirical Analysis of California Data Breaches
An Empirical Analysis of California Data Breaches
2
We collected a dataset of 1,437 breach incidents that were saving files to a non-work cloud storage. Some pur-
reported between January 20, 2012 and September 21, 2018. poses are for committing fraud, identity theft, or theft
of trade secrets.
4.1 Company Data • Lost Computer or Data: An employee loses his/her
We labeled each company that was breached with its “indus- unencrypted computer, physical records of personal in-
try,” “company type,” and “company size” using LinkedIn’s formation are found missing, or mail containing per-
dataset on companies. “Industry” is based on LinkedIn’s In- sonal information is lost in transit.
dustry Codes. [10] “Company type” is one of the follow-
• Phishing Email: An employee is mislead into entering
ing: educational institution, government, nonprofit, partner-
his/her credentials into a spoofed login page.
ship, privately held, public company, self-employed, or sole
proprietorship. “Company size” is one of the following: 1- • Ransomware: Malware encrypts a company network’s
10, 11-50, 51-200, 201-500, 501-1,000, 1,001-5,000, 5,001- files and demands ransom for the files to be decrypted.
10,000, or 10,000+.
While we were able to label every company with its indus- • Social Engineering: A spoofed email impersonates the
try, LinkedIn’s dataset did not have the company type or size CEO or a high-level company executive to mislead an
for every company. Only 75.5% of companies were labeled employee into sending personal information, or an at-
with “company type” and 83.4% of companies were labeled tacker misleads customer support into giving access to
with “company size.” a user’s account.
• Compromised Machine: Physical machines (e.g. • Wrong Data Sent: An employee accidentally sends
point-of-sale credit/debit card terminals, ATM ma- personal information or the wrong personal information
chines) are hacked using methods such as card skim- to an external third-party.
mers.
For the “What Information Was Involved?” section, we
• Data Found Publicly: Personal information is found compiled a list of “personal information” (defined earlier in
online by third-parties or in the physical garbage bin Section 2) that was affected by each breach incident. Other
without being shredded. affected information, such as date of birth and address, could
be voluntarily disclosed in the breach report but is not re-
• Exposed Data: (1) Misconfigured privileges causes
quired by law, so we did not consider other affected infor-
a database or files to be exposed publicly online and
mation in our study due to voluntary response bias.
possibly searchable by Google or enables an employee
without proper authorization to access the files. (2) A
software bug causes a user’s personal information to be 5 Results
displayed to other users.
5.1 Company Profiles
• Insider Theft: A current or former employee exfiltrates
personal information such as by sending files to a non- The companies breached most often were American Express
work email, taking physical records or hard drives, or (5.9%) and Discover Financial Services (1.8%), two major
3
credit card companies. This is not surprising given that both ter January 2016. Prior to January 2016, there were only 4
companies are required to notify their customers every time reported instances of accounting firms being breached.
a dataset of credit card information is found publicly online
(see “Data Found Publicly” in Section 4.2). 5.4% of com-
panies were breached more than once during the time period 5.2 Attack Vectors
between January 20, 2012 and September 21, 2018. The most common attack vector is the generic catch-all term
The top eight industries accounted for over 50% of all data “unauthorized access” (27.0%) because many data breach
breaches across 98 different industries: financial services reports did not explain the specific attack vector. For the
(17.6%), hospital & health care (9.5%), retail (5.4%), hos- data breach reports that did explain how the company was
pitality (4.6%), higher education (4.5%), insurance (3.5%), breached, software vulnerability (13.1%), stolen computer or
medical practice (3.4%), and accounting or government ad- data (11.4%), data found publicly (11.1%), wrong data sent
ministration (tied 4.1%). The top 25 industries accounted for (7.3%), and exposed data (7.2%) accounted for over half of
over 80% of all data breaches. [Figure 1] all attack vectors. [Figure 4]
An overwhelming majority of breached companies were Some attack vectors were concentrated within a small time
either privately held (37.0%) or public company (34.8%). frame. For compromised machine attacks, there was a spike
The remaining company types were nonprofit (11.5%), ed- of 39 incidents in February 2017; excluding that month,
ucational institution (6.9%), government agency (6.0%), compromised machine attacks only averaged 1.35 incidents
partnership (1.9%), sole proprietorship (1.6%), and self- per month. This spike was the result of an attacker installing
employed (0.3%). credit card skimmers on the point-of-sale payment terminals
The majority of data breaches came from large com- for several Acme Car Wash and Clearwater Express stations.
panies with 10,000+ employees (30.3%). Including the Similarly, in October 2016, there were 13 incidents of wrong
5,001-10,000 range (5.3%) and 1,001-5,000 range (16.7%), data sent, compared to the normal average of 1.41 incidents
large businesses altogether accounted for 52.3% of all data per month. This was the result of insurance company Em-
breaches. [Figure 2] This is contrary to prior claims that two- blemHealth inadvertently printing customers’ SSNs on the
thirds of all data breaches come from small to medium-size external mailing labels of packages, which happened repeat-
businesses (SMBs). [12] However, there may be some re- edly for multiple days throughout October before the com-
sponse bias in the data since SMBs are less likely to report pany finally discovered the error.
data breaches, even if required by law, in scenarios such as Some attack vectors were fairly recent phenomenons.
when an employee loses a laptop containing personal infor- Ransomware attacks started happening in July 2016 with
mation. hospitals and medical practices being the primary targets.
Prior work found that companies that contain a data breach Before then, there was only a single reported incident of ran-
in under 30 days save over $1 million compared to those somware, which affected the law firm Ziprick & Cramer,
that take more than 30 days to resolve. [7] According to our LLP in January 2015. Likewise, phishing email attacks
findings, only 21.5% of data breaches were reported within started happening consistently every month since February
30 days. While the median report time was 78 days, the 2016, averaging 2.03 incidents per month. Before then, there
distribution of report times was heavily skewed right such were only scattered incidents of phishing email attacks, av-
that the average report time was 175 days. The longest time eraging just 0.16 incidents per month. [Figure 5]
it took to report a data breach was 7 years, 6 months, and 9 There is usually a single attack vector that accounts for a
days (2,747 days). large number of data breaches in each industry. Data found
18.6% of companies that reported breaches were unable publicly was by far the largest cause of data breaches for
to ascertain the exact date(s) when the data breach occurred. financial services companies (63.8%). Others include: soft-
For those that were able to, there was an average of 18 data ware vulnerability for apparel & fashion (62.5%), consumer
breaches per month with a maximum of 60 data breaches in goods (60.0%), and retail (52.5%); compromised machine
February 2017. (June through September 2018 may be un- for hospitality (57.9%) and restaurants (57.1%), stolen com-
derreported since it takes on average 175 days to report a puter or data for medical practice (52.6%); and exposed data
data breach that occurred.) The number of data breaches has for computer software (50.0%). [Figure 6] The data also cor-
been steadily increasing at a rate of 0.18 more data breaches roborated prior work that showed internal negligence was to
each month compared to the previous month. There was blame for most data breaches involving personal health in-
also a slight seasonal pattern in data breaches with a small formation. [13]
increase in the number of data breaches during February Similarly, there is usually a single industry that accounts
through April. [Figure 3] for a large number of data breaches for each attack vector.
Accounting was the only industry with a significant Financial services was by far the largest industry for the data
change in frequency of data breaches over time. 94.2% of found publicly attack vector (90.3%). Others include: hos-
all data breaches that affected accounting firms happened af- pital & health care for lost computer or data (35.7%); hos-
4
Figure 1: Frequency of data breaches for the top 25 industries out of 98 different industries. The top 8 industries accounted for
over half of all data breaches, and the top 25 industries accounted for over 80% of all data breaches.
5
Figure 3: Number of data breaches per month since January 2012, broken down by each attack vector. There was an average
of 18 data breaches per month with a maximum of 60 data breaches in February 2017. The number of data breaches has been
steadily increasing at a rate of 0.18 more data breaches each month compared to the previous month.
6
Figure 8: Frequency of each information stolen for the top
25 industries. In general, social security numbers and pay-
ment cards were the two most common personal information
stolen regardless of the industry, except for the industries that
dealt with medical records.
7
6 Conclusion and Future Work [9] “Search Data Security Breaches.” State of
California Department of Justice, DOJ,
The attack vectors and information stolen in data breaches oag.ca.gov/privacy/databreach/list.
tend to follow a predictable pattern depending on the com-
pany’s profile. For instance, we found that for many indus- [10] “Industry Codes.” LinkedIn Developers, LinkedIn Cor-
tries, the types of attack vectors and information stolen are poration.
concentrated in only a few categories. Based on our findings,
[11] California Civil Code. Title 1.81, Section 1798.82.
we can better predict how a company is going to be breached
and what information is at risk of getting stolen. This is very [12] Boneh, Dan. “Why Is Computer Security Difficult?”
useful for not only high-risk organizations but also cyber in- Cybersecurity: A Legal and Technical Perspective. 3
surance underwriters that have to create cyber risk models to Apr. 2018, Stanford, California.
determine premiums based on the company’s profile.
There are many possible areas for future work. This study [13] “Internal Negligence to Blame for Most Data Breaches
only focused on California data breaches, but we could ex- Involving Personal Health Information.” Help Net Se-
tend this study to compare California data breaches to those curity, Help Net Security, 25 Nov. 2018.
from other states, since all states have a data breach notifi-
[14] Greenberg, Andy. “The Untold Story of NotPetya,
cation law. There may be notable differences because many
the Most Devastating Cyberattack in History.” Wired,
tech companies are located in California.
Conde Nast, 24 Oct. 2018.
Furthermore, we could assess the financial damage of
data breaches. Some data breaches do not materially af-
fect the company’s bottom line, while others, such as the
Equifax data breach, greatly impact the company’s finan-
cials. With this information, we can ascertain what types of
data breaches cause more financial damage than others, and
whether the severeness of financial damage correlates with
the company’s profile.
Lastly, for data breaches that were able to attribute who
was responsible, we could figure out if and/or how such data
was used maliciously.
7 References
[1] “Data Breach Activity Reaches All-Time High.” Help
Net Security, Help Net Security, 23 May 2017.