Evaluation of Using Big Data For Credit Ratings

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Models Using Big Data Approaches to Improve Credit

Ratings / Scorings in Retail Business

- Term paper at the HSBA Hamburg School of Business Administration during the
study course Master of Science in Global Management & Governance

Student:

Tim Decker

Registration number:

0616

Students date of birth

July 28, 1988

Module:

Topical Research

Lecturer:

Prof. Dr. Christoph Bauer

Date of submission:

June 13, 2016

Abstract
This Paper aims to answer the question how the use of big data can improve
consumer credit scores. Firstly the term big data is defined and distinguished from
other forms of data. Secondly the methodology of credit scores is explained and the
data sources that are used for analysis are explored. In the following the benefits
and challenges in utilizing big data for credit scores are listed and assessed. To
show how these work out in practice the business model of the fintech startup
Kreditech is explained and analyzed. The paper finishes with recommendations on
when to use big data in credit scores and how to deal with the challenges.

Table of contents
1.

Introduction ..................................................................................................... 1

2.

Explanation of underlying concepts .............................................................. 1

3.

2.1.

Big data ...................................................................................................... 1

2.2.

Credit scores .............................................................................................. 2

Big data in credit scores ................................................................................. 5


3.1.

Benefits ...................................................................................................... 5

3.2.

Challenges ................................................................................................. 8

4.

Example business: Kreditech....................................................................... 11

5.

Conclusion..................................................................................................... 12

6.

Limits of the Paper ........................................................................................ 13

List of sources ......................................................................................................... III

List of Figures
Figure 1: Influence of information on loan default rates............................................ 4

II

1. Introduction
Lenders face the dilemma of either giving out too few loans (and lose potential
revenues) or to give out too many (and lose the principal amount to defaults). To aid
them in the decision whether or not to give an applicant a loan, many lenders use
scoring or rating models. These are statistical tools that use data to evaluate the
creditworthiness of the applicant. When lending to companies the lender usually has
access to extensive financial information about the borrower. For a long time lenders
who focused on consumers had more difficulties to gain sufficient insight into their
applicants ability to pay back loans. Today most consumers create vast pools of
data and profiles online, which are used by some fintech startups like Kreditech,
Vantage Score or Big Data Scoring to make their credit scores more precise.
While this might pose a great opportunity for lenders, many consumers regard it with
skepticism and fear an intrusion of their privacy. Since trust is an important asset in
the financial sector, it is important to have a clear strategy of how to deal with this,
regardless of whether there are real security problems or only a subjective
discomfort.

2. Explanation of underlying concepts


2.1.

Big data

Businesses and people create enormous trails of data during their online activity.
Big data doesnt necessarily have to stem from online activity, but the sheer number
of people and devices which are networked today generate a major part of available
data. Data was always important for businesses and other areas where decisions
have to be made, but because of technological advances more data than ever
before is available. The main advancement which caused this is computing power,
which is necessary to store and analyze data. Other major facilitators are mobile
devices and autonomous sensors, which ease the creation, communication and
access to data.1
Big data is often described by three terms: volume, velocity and variety. Volume
describes the huge amount of data that is available. Velocity refers to the high
speed with which new data is created, transmitted and analyzed. Variety illustrates
that there are different sources and types of data. The data types are clustered into
1

McKinsey & Company, Inc. 2011, 1-3

structured (clearly defined content, often stored in databases), semi structured


(automated communication between devices) and unstructured data (texts, videos,
pictures, etc.).2
People mostly create unstructured data, which makes up for the vast majority of
available data. Even though a lot of unstructured data is available, it is difficult to
automatically analyze and draw conclusions from it. To tackle this problem,
statistical techniques like machine learning, pattern recognition and predictive
modeling are utilized. This can improve decision making by incorporating more
information than any single person could handle or comprehend.3

2.2.

Credit scores

Credit scores are models to evaluate the likelihood of loan applicants paying back
their loans. They are used by banks and other lenders to determine whether to grant
loans or not and to price them in accordance to the risk. This is done by comparing
the applicant to other borrowers that borrowed money in the past. If the new
applicant is similar to those that payed back their loans, he will get a better score. If
he is similar to those that didnt pay back, his score will be worse.4
To accomplish this, firstly the event that shall be predicted has to be clearly defined,
in this case pay back the loan. This can range from payed each installment in
time to payed back the agreed amount but only after renegotiation and extension
of the duration. A common definition among lenders is was never more than 90
days late in payments. The aforementioned comparison of applicants to previous
borrowers is done by comparing multiple attributes of each. This requires access to
historical data of previous borrowers (especially at the point in time when they
applied for the loan) and information whether these have paid back their loans or
not. To warrant a valid statistical analysis, the transactions have to be fundamentally
similar (there is a difference between financing a TV for 1,000 or a house for
500,000 ), the available data has to be correct and a high number of past
transactions is needed. The quality of the information is also very important. If the
analysis is based on wrong information or if there is incomplete information
regarding individual transactions it may lead to wrong results.

2
3
4

BITKOM 2015a, 13-14


Ibid, 13-31
Schrder, Taeger 2014, 26-29

The most common statistical approaches to credit scores are multivariate regression
models. They are able to connect multiple independent variables and their relations
to each other to one dependent variable (here the creditworthiness). The most
common one among them is the classification and regression tree (CART). These
are especially helpful when dealing with large amounts of data, because they filter
attributes (independent variables) and use those with the most information for the
model first. To avoid overfitting the model, which can occur if there are too many
attributes compared to the number of observations, the gini measure is used.
Otherwise the model would get too sensitive and also react to outliers and noise of
the training data. Another issue with credit ratings is the highly skewed proportion of
good and bad outcomes (depending on the region and industry the share of credit
defaults is only 2-10% of the observations). To counter this problem, a boosting
technique can be used which applies stronger weights to the scarcer observations.
Machine learning is often employed to build the models since it is able to recognize
patterns in large data sets.5
As a result the CART will provide a numerical score for each credit applicant. To
convert that score into a credit decision a threshold (minimum score to needed to
give a loan) has to be used. Setting the level for this threshold involves a trade-off
between more false positives (applicants which will not pay back the loan but are
accepted) and false negatives (applicants who would have paid back the loan but
are declined). This should be done by a cost/benefit analysis. False positives lead to
high costs because the principal loan amount is lost. False negatives have a less
significant direct impact, because the lender only loses the revenues he would have
made with that loan. However, when considering that a customer might also use
other products of a financial services provider and might stop to do so after being
declined for a loan, the loss in revenues increases.6
An example for the benefit of using scoring models: There are 100 loan applicants
and 10 of them will default on their payments. If a lender would give out loans to
random applicants, statistically 10% of them would default. If the lender would have
perfect information, he could give loans to the 90 people who will pay back without
any defaults. A scoring algorithm tries to shift the real outcome from the random
distribution towards the perfect selection.

5
6

Khandani, Kim and Lo 2010, 2773-2777


Ibid, 2780-2783

Figure 1: Influence of information on loan default rates7


10
9

number of defaults

8
7
6
5

random distribution

with scoring

perfect information

2
1
0
0

10

20

30

40

50

60

70

80

90

100

number of loans given

The performance of a scoring algorithm is measured by the distance of the results


with the scoring compared to the results of a random distribution. This is also called
area under the curve (AUC).8 This measures the saved costs by reducing the
number of false positives and the increased revenues because of the reduced
number of false negatives.
Another important aspect besides the statistical model is the data which is fed into
the model. The first credit scores only relied on credit bureau data like credit history,
address and what kinds of bank accounts are used.9 More recent models started to
include transactional data like changes in income or account balance, channels of
payments (cash, bank transfer, credit card) and categories of payments (housing,
cost of living, entertainment, credits). This information is only available if the lender
is able to get information from the applicants main bank account(s). For most banks
this is given, because customers often apply for credits at the bank where they have
their other accounts.10 Today some companies started to include additional (big)
data sources like social media websites or the way an online application is filled in.
While credit bureau data is very relevant for the problem at hand, it is also often
outdated and incomplete. Transactional data is more up-to-date and also very
7
8
9
10

Modified from Lorenz 2012, 21


Kraus 2014, 1-3
Lang et al. 2014, 30-32
Kallerhoff 2013, 13-19

relevant but is only accessible by very few institutions. Big data is the most current,
most widely accessible and also largest source of data. On the other hand only
some of it has relevance for the creditworthiness of the applicant and it is sometimes
incorrect.

3. Big data in credit scores


3.1.

Benefits

As explained above the financial industry already utilized statistical analyses of data
as a basis for decisions. Implementing big data into that would not be a completely
unique approach but an extension of the existing models. However big data based
models could provide a way to analyze larger pools of data than any other method
employed before.11
This is especially important since the digitalization of the finance industry is picking
up pace. Relationships between consumers and banks get less intensive and more
anonymous. Consumers use more and more online solutions while staying away
from the classical bank branches. This makes it even more difficult for banks to
make good credit decisions. In the past the bank clerk often had a long personal
relation to the consumer and could include the personal impression and experiences
into the decision.12 By including information from social networks and other online
sources the personal aspect can be substituted by the scoring algorithm to some
degree. The amount of credit decisions that have to be taken by consumer lenders
also make automatization attractive from a process point of view. Reducing the
direct involvement of personnel reduces labor costs, leads to more consistent
decisions and provides a faster service to the customers.13
The next question to answer is which information to use. Only testing will show
which data points will provide a benefit to the model, but it is possible to name
categories of information that can be included. An often cited example is social
network data. This includes the number and types of accounts someone has
created, the size and composition of his network or his quantitative and qualitative
activity. Most networks offer the option to support or like certain groups, events,
companies, news articles or statements. While it is very difficult to automatically

11
12
13

BITKOM 2015a, 37
Lang et al. 2014, 26-27; Parker and Wolkowitz 2015, 3-6
Khandani, Kim and Lo 2010, 2767-2770; Kraus 2014, 1-3

analyze and evaluate specific written comments because they have no common
structure, it is possible to analyze likes because of their binary nature.14 As long as
the credit application is done online the device and setup of the applicant be used as
a source of information as well. This starts with the type of device (desktop, tablet,
mobile), the producer (Apple, Samsung etc.), connection speed and goes on to
installed programs (which can only be observed indirectly, for an example see the
last paragraph of this chapter).15 Another important source of information is the
application process itself. If this is done via an online form it is possible to track the
time of day of the application, the time needed to complete the application or
individual questions and how often entries are deleted or revised. This information is
especially relevant in finding out whether the applicant is a real person or just a
computer program created to commit fraud with faked profiles.16 Other lenders use
optional online tutorials and quizzes on how to manage loans and track the
applicants willingness to follow through with these.17 Most likely none of these data
points alone would be able to predict payment behavior, but together they reveal a
lot about the loan applicant.
To answer the question of whether models using big data are more accurate than
others, it would be best to apply both models to the same set of loan applicants,
then actually give the loans and see which models predictions were better. Sadly
such studies have not been done so far (for more information see chapter 7. Limits
of the Paper). Another approach is to look at the statistical models employed and
how they change when more data is provided to them. In general more data
contains more information and with more information more accurate predictions are
possible. But for practical application this statement is way too simple. First of all
new data doesnt always include new information. Sometimes the information
included in the new data is also included in other data that is already part of the
model. This effect is called collinearity and happens if two independent variables
have a high correlation. If the new data would be added regardless of this, it would
increase the complexity of the model without making it more accurate. While new
information does indeed make a model more accurate, it also has a downside
because it reduces the generalizability of the model. This means that it would
describe the test data which the model is built upon better than before, but when
14
15
16
17

Sadowski and Taylor 2015 and Graepel, Kosinski and Stillwell 2013, 5802-5804
Mller 2016
Mller 2016; Gutierrez 2014, 7-11
Parker and Wolkowitz 2015, 11-13

applied to new observations (in this case new loan applicants) the accuracy would
decrease. The reason for this is statistical noise, which describes rare or unlikely
events that effect very few observations and are different for each set of
observations. This problem can be overcome by increasing the number of
observations used to build the model. The statistical noise would still be there but
would have a smaller effect since it can be identified as an outlier. Usually the
number of observations is a strong limiting factor for statistical analysis, but in the
case of consumer lending, lenders have access to vast amounts of observations. At
some point another factor will become relevant which is computing power. A more
complex model will require more computing power for its analyses. It can be
concluded, that new sources of information which can be gathered from big data are
able to increase the accuracy of credit scoring models. To keep the generalizability
of the model intact it is important to also increase the number of observations along
with the number of independent variables.18
Regression models are often criticized because they rely on correlation between
dependent and independent variables without asking for the causal relation. If this
was a study on how to improve creditworthiness of consumers then it would be
important to search for causality between independent and dependent variables.
Then one could figure out the causes of low creditworthiness and design methods to
fight them. Regardless of whether they are a spurious correlation (for example a
correlation caused by a third unknown variable that influences both) or a direct
correlation, correlation is a sufficient indicator for credit scores.19 A prime example
for this is the casino font: Kreditech found out that if loan applicants had a certain
font type installed their likelihood of paying back the loan was significantly lower. At
a first glance there is no causality that could explain this. After some research they
found out, that this specific font is used by online casino software and is
automatically installed when using such software.20 After finding this third variable it
was easy to make the connection between online gambling and lower credit scores,
but even if this relation would have never been found, the scoring algorithm would
still have the same explanatory power.

18
19
20

Esposito et al. 1999, 277-299; Hill and Lewicki 2006, 81-94; Kelley and Maxwell 2008, 306-318
Hill and Lewicki 2006, 81-90
Seibel 2015

3.2.
When

Challenges
the

German

market

leader

for

consumer

credit

scores,

the

Schutzgemeinschaft fr allgemeine Kreditsicherung (schufa), publicly thought about


incorporating big data into their scores, the outcry was huge. It only took a few days
of media attention for their research partner the Hasso Plattner Institut to terminate
the research contract and a few more days until the schufa decided not to advance
this idea further at the moment.21 But why did this happen? Why is the public opinion
so strongly against using big data in the context of credit scores?
First of all people are skeptical of the new technology because they dont
understand it and fear it might have negative consequences for them. To counter
this misunderstanding, transparency should be the first choice. It has many aspects,
some of which can easily be implemented while others lead to conflicts of interest.
First of all the benefits for the companies using the big data approaches but also for
the consumers have to be explained. This will help in gauging the intentions behind
the process. The benefits for the companies have been explained above, but also
consumers profit from more accurate credit scores. No consumer would want to take
on a loan that he is not able to repay and the reduced costs for lenders will lead to
reduced prices (interest rates) as long as there is sufficient competition.22
The next transparency aspect is the designated purpose of the data. Lenders should
clearly obligate themselves to only use the information to evaluate the loan
application and not share the base data with third parties. This would take away a lot
of uncertainty regarding the use and access to the data.23
Another step to gain acceptance from consumers would be to create an opt-in
procedure. This comes with a downside for the company using the model because it
would either lose all the customers who dont choose to opt-in or would have to use
another scoring model for those who decide against the big data approach. By
excluding customers who reject the approach the whole opt-in option become a
farce because the consumer has no real decision; he either accepts the terms or
cant use the service (get a loan). When still doing business with consumers who
didnt opt-in it could lead to a situation in which loan applicants with better
creditworthiness are willing to give access to additional data while those with lower
21
22
23

Borchers 2012; Hasso Plattner Institut 2012; Hornung and Webermann 2012
BITKOM 2015a, 65-71
Ibid

creditworthiness will decline. This might lead to an adverse selection process which
pools high risk consumers in the old system which would increase the risk premium
for everyone who is not willing to give access to his data. Again the consumers
would basically be forced to accept the new system.24
Consumers are also interested to know which data exactly is used. Fulfilling this
wish can be technically difficult because of the sheer amount of data. Some
companies claim to use up to 20,000 individual data points.25 But it would be
possible to communicate the categories of data and give some examples for
individual data points as done in chapter 3.1. This leads to the next problem which is
data accuracy. People usually dont check if all the data that can be linked to them
online is correct. If they are not aware which data is used for credit scoring they
have no chance to notice if incorrect or incorrectly connected data is used. Not only
the consumer but also the company employing the model have an interest in using
correct data, because only that can lead to correct predictions of the model.

26

The biggest step in terms of transparency would be to disclose how exactly the
scoring model works. However this comes with a multitude of problems. Multivariate
regression analyses are quite complex and not easily understood by consumers
who dont have the necessary statistical background. This is also a problem when a
customer wants to know why his loan application was declined. When using
nonlinear models there is no simple answer to that question. But even if lenders
would be able to explain the models to consumers, they wouldnt want to do so. If
consumers understood how credit scoring models work, they would be able to
influence the scores, not by becoming more financially stable but by gaming the
system. For example if a loan applicant would know that he would get a better credit
score if the application is done from an apple tablet he might borrow one from a
friend even if he doesnt own one himself.27 And there is a third aspect that prevents
transparency in this area. The scoring model is an extremely important business
secret for a lender and if it would be disclosed in a transparent manner, competitors
would be able to copy it or improve their own systems without putting in the research
effort.28

24
25
26
27
28

Polonetsky and Tene 2012, 63-68


Kreditech 2016a
BITKOM 2015a, 66-68
Hi and Rna-Tas 2008, 13-18
BITKOM 2015a, 71-78

To protect the privacy of the concerned consumers many big data technologies
utilize anonymization even though that is very difficult to begin with. Even if all
personal data points like name, address etc. are deleted, the huge amount of other
data points is so unique, that it would often be possible to track it back to the specific
person.29 In this case the personalization of the data sets is also very important.
When a score is calculated it has to be allocated to someone, which is not possible
if it is done anonymously. But even after the score is calculated it is important to
compare the real repayment behavior to the predicted one to change the model if
necessary. This can only happen if the original dataset can be connected to the loan
or the person paying the loan.
Some of the distrust is also created because this is still a young phenomenon and
no social standards have developed on how to handle it. Consumers are just
learning that they can (and sometimes have to) pay for services with their data.
Examples for this are social networks or the diverse google apps. While the service
is free for the user he still provides a benefit for the company in the form of data
(and as a marketing target).30
When talking about trust the security of the actual data has to be addressed. Big
stores of data are valuable and thus there will be individuals or organizations who
might try to steal them or gain access to them. Banks are already very versed in this
topic because they have to protect their system of accounts and transactions. Other
parties who only focus on the lending or offer scoring-as-a-service will have to put
additional resources into the protection of their IT infrastructure.
All security and privacy concerns aside there is one big problem left. The model has
to be fed with the data of loan applicants at the date of application and their later
repayment behavior. While the latter is widely available for lenders, few if any of
them have recorded the data points which shall be used in new big data models at
the time of application. This means there is no data to build the model on. Lenders
now have the option to either run a new model in parallel to their old one just to
collect data and continue to base decisions on the old one for the transition period or
to collect the data for the new model and give loans to all applicants to see what
happens. Both ways are very costly, especially the second one. Entrenched players
like big banks fear these substantial costs and want to wait until there is a model
29
30

Graepel, Kosinski and Stillwell 2013, 5802-5805


Borgmeier 2013

10

that they can use from day one.31 This leads to the hen egg paradox, where the
model needs access to a huge number of customers to become accurate but will
only gain said access once the accuracy is proven. The implementation is further
hindered by the uncertain data protection legal situation. Most laws concerning data
protection stem from a time when big data was only science fiction. This means that
many questions arising today are not at all or not completely regulated by these
laws. This creates insecurities for the companies dealing with big data because they
cannot be sure if their practices will be permissible in the future, which makes it
difficult to justify huge investments.32

4. Example business: Kreditech


Kreditech is a Hamburg, Germany based fintech startup. It was founded in 2012 and
by 2016 it employs 300 employees and has raised 314 billion Euros in equity and
debt. Its main markets are Poland, Spain, Czech Republic, Russia and Mexico. It
uses a scoring model similar to those described in chapter 3.1 and mainly targets so
called underbanked or badly banked consumers which cannot be scored by
traditional scoring systems since no historical data for them is available. According
to Kreditech this applies to 73% of the worlds population.33
It started its business by offering scoring-as-a-service to other lenders but faced the
implementation problem described in chapter 3.2. Kreditech then adapted its
business model and also started providing loans to consumers itself. This enabled it
to generate revenues more quickly as well as to enhance and prove the scoring
model. According to Kreditech their consumer lending business units (in regional
subsidiaries) became profitable in each market after just a few months of adapting
the scoring model to the distinctive local characteristics. Kreditech also had to deal
with another challenge described in chapter 3.2: In 2015 someone was able to steal
some of the customer data and tried to blackmail the company with it. The
management kept calm, worked with the police and argued that it could only have
been an inside job of a former employee, because of their high security standards.34

31
32
33
34

Wack 2014
BITKOM 2015a, 88-91
BITKOM 2015b, 80-83; Kreditech 2016b
BITKOM 2015b, 80-83; de Souza Soares 2015; Kreditech 2013

11

5. Conclusion
To be able to make use of the benefits of big data scoring models, the challenges
have to be dealt with. Since every company and market is different there can be no
universal approach that fits everyone, but a general roadmap could look like the
following:
The first step should be to generate as much transparency about the model and the
process as possible without disclosing critical details. This includes the reason why
such an approach is used and how the company and the consumer can benefit from
it. It should then be made clear what type of data is used and maybe give some
examples for each category but not disclose all data points that are included in the
model. The consumer should also be informed what the data is used for, either only
the credit scoring or also for other purposes. As explained in chapter 3.1 the
transparency should not go so far as to disclose the exact workings of the model.
Anonymization and opt-in / opt-out models could increase the consumer acceptance
but would bring too many downsides for the lending company to actually implement
them. The implementation problem requires a different approach. If a lender
develops his own model, implementation should not be a big issue. The decision for
the approach has already been made and there is access to at least some past
customer data. It would be advisable to run the new model in the background for a
few months to let it gain accuracy with real data and at some point switch to the new
one. As a scoring-as-a-service provider one approach could be the path Kreditech
took, which was described in chapter 4. They started giving out loans themselves to
enhance and prove the model and only then sold it as a service. The downside of
this approach is that it is very capital intensive. Another alternative would be to cover
the implementation costs for the lender and then charge higher recurring fees.
These could be tied to the increase in profits for the lender to show even more
commitment as the service provider. Again this way is very capital intensive.
If lenders are willing to implement it and consumers are willing to work with it the
scoring model could be used in almost any situation where the creditworthiness of a
consumer has to be gauged. While it would also be helpful to use it in bank
branches or at retailers who offer financing services for their products, the greatest
benefit could be generated when using it for online loan applications. This way even
more data points can be generated (user input behavior and device) and a lot of

12

process steps can be automated. This automatization will allow the lender to save
costs which will further increase profits.

6. Limits of the Paper


The main limitation is the lack of empirical research on the topic. The companies
employing big data models for credit scorings claim that their models are superior
and some even speak of studies that prove this correct. On the other hand these
studies are not published and thus remain of questionable reliability. To conduct an
own study would go way beyond the scope of this paper and even trying to do so
poses several problems:
Innovative companies using big data models for credit scorings have no interest in
publishing how exactly their algorithms work. It is their business secret with which
they earn their money and no one can force them to reveal it. They do conduct their
own testing but with the goal to promote their services, so the findings are very likely
to be biased and cannot be trusted. Even if one would have access to the algorithms
one would also need access to the data of a relevant number of loan applicants as
well as their later repayment behavior and the algorithms of traditional models to
compare to. These are also difficult to come by since banks and other lenders are
just as reluctant to reveal their customers. In some countries they are even explicitly
prohibited from doing so by data protection laws.
The legal situation also plays a major role in big data topics. Since it is still a quite
young topic there are no established legal procedures. As described in chapter 3.2
laws concerning data protection have yet to recognize the possibilities of todays
information technology. In addition to that the legal situation is different in each
country which further complicates affairs for companies in this field of business. An
analysis of the legal situation in each country would again go way beyond the scope
of this paper. On the other hand it should be an important step for every company
who thinks about using the technology to examine the boundaries in the markets
they are going to enter. This makes it difficult to give one universal recommendation
of how to approach the topic.

13

List of sources
Literature:
BITKOM Bundesverband Innovationswirtschaft, Telekommunikation und neue
Medien e.V.. Leitlinien fr den Big-Data-Einsatz Chancen und Verantwortung.
Working

Paper

(2015a).

Date

accessed

17

May

2016.

[https://www.bitkom.org/Publikationen/2015/Leitfaden/LF-Leitlinien-fuer-den-BigData-Einsatz/150901-Bitkom-Positionspapier-Big-Data-Leitlinien.pdf].
BITKOM Bundesverband Innovationswirtschaft, Telekommunikation und neue
Medien e.V.. Big Data und Geschftsmodell-Innovationen in der Praxis: 40+
Beispiele.

Working

Paper

(2015b).

Date

accessed

16

May

2016.

[https://www.bitkom.org/Publikationen/2015/Leitfaden/Big-Data-undGeschaeftsmodell-Innovationen/151229-Big-Data-und-GM-Innovationen.pdf].
Esposito, Floriana, Malerba, Donato, Semeraro, Giovanni and Tamma, Valentina.
The Effects of Pruning Methods on the predictive Accuracy of induced Decision
Trees. Applied Stochastic Models in Business and Industry, volume 15 (1999), 277299
Graepel, Thore, Kosinski, Michal and Stillwell, David. Private traits and attributes
are predictable from digital records of human behavior. Proceedings of the National
Academy of Sciences of the United States of America, volume 110, number 15
(2013), 5802 5805
Gutierrez, Daniel D.. InsideBIGDATA Guide to Big Data for Finance. Working
Paper

(Inside

BIGDATA

2014).

Date

accessed

16

May

2016.

[http://whitepapers.insidebigdata.com/?option=com_categoryreport&task=viewabstr
act&pathway=no&title=40947&frmurl=https%3a%2f%2ffanyv88.com%3a443%2fhttp%2fforms.madisonlogic.com%2fF
orm.aspx%3fpub%3d636%26pgr%3d1164%26frm%3d1926%26autodn%3d1%26sr
c%3d12804%26ctg%3d1%26ast%3d40947%26crv%3d0%26cmp%3d13219%26yld
%3d0%26clk%3d6284397387384293890%26embed%3d1].

III

Hill, Thomas and Lewicki, Paul. Statistics: Methods and Applications. A


Comprehensive Reference for Science, Industry and Data Mining. Tulsa OK, 2006
Hi, Stefanie and Rna-Tas, kos. Consumer and Corporate Credit Ratings and
the Subprime Crisis in the U.S. with some lessons for Germany. Working Paper
(2008).

Date

accessed

29

May

2016.

[http://pages.ucsd.edu/~aronatas/The%20Subprime%20Crisis%202008%2010%200
4.pdf].
Kallerhoff, Philipp. Big Data and Credit Unions: Machine Learning In Members
Transactions. Working Paper (Filene Research Institute 2013). Date accessed 16
May 2016. [https://filene.org/assets/pdf-reports/301_Kallerhoff_Machine_Learning
.pdf].
Kelley, Ken and Maxwell, Scott E.. Sample Size for Multiple Regression: Obtaining
Regression Coefficients That Are Accurate, Not Simply Significant. Psychological
Methods, volume 8, number 3 (2003), 305 - 321
Khandani, Amir E., Kim, Adlar J., Lo, Andrew W.. Consumer credit-risk models via
machine-learning algorithms. Journal of Banking & Finance, volume 34 (2010),
2767 2787
Kraus, Anne. Recent Methods from Statistics and Machine Learning for Credit
Scoring. Diss PhD, Department of Mathematics, computer sciences and statistics,
Ludwig-Maximilians-Universitt, 2014
Lang, Gunnar, Lerbs, Oliver, Radev, Deyan and Schder, Michael. konomische
Bedeutung und Funktionsweise von Credit Scoring. In Scoring im Fokus:
konomische Bedeutung und rechtliche Rahmenbedingungen im internationalen
Vergleich. Published by Die Wirtschaftsauskunfteien e.V., 1 90. Oldenburg: BISVerlag der Carl von Ossietzky Universitt Oldenburg, 2014.
McKinsey & Company, Inc.. Big data: The next frontier for innovation,
competition, and productivity. Working Paper (2011). Date Accessed 16 May 2016.
[http://www.mckinsey.com/~/media/McKinsey/Business%20Functions/Business%20
Technology/Our%20Insights/Big%20data%20The%20next%20frontier%20for%20in
novation/MGI_big_data_full_report.ashx].

IV

Parker, Sarah and Wolkowitz, Eva. Big Data, Big Potential: Harnessing Data
Technology for the Underserved Market. Working Paper (CFSI Center for Financial
Services

Innovation

2015).

Date

accessed

06

May

2016.

[http://www.morganstanley.com/sustainableinvesting/pdf/Big_Data_Big_Potential.pd
f].
Polonetsky, Jules and Tene, Omer. Privacy in the Age of Big Data: A Time for Big
Decisions. Stanford Law Review online, volume 64, number 63 (2012), 63 69

Internet sources:
Borchers, Detlef. Schufa-Kritiker befrchten Scoring via Facebook. Heise Online.
Date

created

07

June

2012.

Date

accessed

05

June

2016.

[http://www.heise.de/newsticker/meldung/Schufa-Kritiker-befuerchten-Scoring-viaFacebook-1612731.html].
Borgmeier, Elmar. Faire Regeln fr Kredit-Scoring mit Big Data. Date created 25
September

2013.

Date

accessed

06

May

2016.

[http://nextgenerationfinance.de/faire-regeln-fuer-kredit-scoring-mit-big-data/].
de Souza Soares, Philipp Alvares. Dieses Kredit-Startup wurde mit geklauten
Daten erpresst. Manager magazin. Date created 10 April 2015. Date accessed 08
June 2016. [http://www.manager-magazin.de/unternehmen/banken/kreditech-mitgestohlenen-daten-erpresst-a-1027858-2.html].
Hasso Plattner Institut. Schufa-Forschungsprojekt gekndigt. Date created 08
June

2012.

Date

accessed

05

June

2016.

[http://hpi.de/pressemitteilungen/2012/schufa-forschungsprojekt-gekuendigt.html].
Hornung, Peter and Webermann, Jrgen. Schufa will Facebook-Daten sammeln.
NDR Info. Date created 07 June 2012. Date accessed 05 June 2016.
[http://www.ndr.de/nachrichten/netzwelt/schufa115.html].
Kreditech. Global Founders Capital, Blumberg Capital, Point Nine Capital and
Heiko Hubertz invest additional 7-digit figure in Kreditech. Date created 22 April
2013.

Date

accessed

08

June

2016.

[https://www.kreditech.com/press_release/global-founders-capital-blumberg-capitalpoint-nine-capital-and-heiko-hubertz-invest-additional-7-digit-figure-in-kreditech/].

Kreditech. What we do. Date created n/a. Date accessed 05 June 2016a.
[https://www.kreditech.com/what-we-do/].
Kreditech. Kreditech Press Factsheet. Date created n/a. Date accessed 08 June
2016b.

[https://www.kreditech.com/wp-

content/uploads/2015/04/Kreditech_Factsheet2016.pdf].
Mller, Alexander G., Future of Banking based on Algorithms. Date created 26
January

2016.

Date

accessed

05

June

2016.

[https://www.youtube.com/watch?v=Z57UVGxt_6E].
Sadowski, Jathan and Taylor, Astra. How Companies Turn Your Facebook Activity
Into a Credit Score. The Nation. Date created 27 May 2015. Date accessed 06 May
2016. [http://www.thenation.com/article/how-companies-turn-your-facebook-activitycredit-score/].
Seibel, Karsten. Gegen Kreditech ist die Schufa ein Schuljunge. Die Welt, Date
created

17

April

2015.

Date

accessed

05

June

2016.

[http://www.welt.de/finanzen/verbraucher/article139671014/Gegen-Kreditech-ist-dieSchufa-ein-Schuljunge.html].
Wack, Kevin. Fannie, Freddie to Evaluate Alternative Credit-Scoring Models.
American Banker. Date created 22 September 2014. Date accessed 31 May 2016.
[http://www.americanbanker.com/issues/179_183/fannie-freddie-to-evaluatealternative-credit-scoring-models-1070140-1.html].
Lecture Handouts:
Lorenz, Stefan for RDG Management-Beratungen GmbH (2012) Risikoanalyse und
Rating. [lecture handout]. From a financial statement analysis and risk detection
lecture, held on 20 July, Hanseatische Sparkassenakademie.

VI

Honorary declaration
I hereby declare that I
1. wrote this term paper without the assistance of others;
2. have marked direct quotes used from the literature and the use of ideas of other
authors at the corresponding locations in the paper;
3. have not presented this paper for any other exam.
I acknowledge that a false declaration will have legal consequences
Tim Decker

VII

You might also like