0% found this document useful (0 votes)
9 views10 pages

Where To Find Data PDF

Uploaded by

hejige4866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Where To Find Data PDF

Uploaded by

hejige4866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

WHERE TO

FIND DATA

FOR YOUR DATA SCIENCE


PROJECTS

SOME IDEAS FOR FINDING


DATA ONLINE
KAGGLE.COM

Companies post a dataset and a


question, and usually offer a prize for
the best answer.

Kaggle also has discussion forums and


“kernels” in which people share their
code so you can learn how others
approached the dataset.

As a result, Kaggle has thousands of


datasets with accompanying
questions and examples of how other
people analyzed them. THE NEWS
The biggest benefit of Kaggle is also
its biggest drawback: by handing you Recently, many news companies
a (generally cleaned) dataset and have started making their data
problem, it’s done a lot of the work for public.
you.You also have thousands of people
tackling the same problem, so it’s FiveThirtyEight.com, for example, a
difficult to make a unique website that focuses on opinion-
contribution. poll analysis, politics, economics,
and sports blogging, publishes data
One way to use Kaggle is to take a it can use for articles and even links
dataset but pose a different question to the raw data directly from the
or do an exploratory analysis. article website.

But generally, we think that Kaggle is Although these datasets often


best for learning by tackling a project require manual cleaning, the fact
and then seeing how you performed that they’re in the news means that
compared with others, thus learning an obvious question is probably
from what their models did, rather associated with them.
than as a piece of your portfolio..

“IN GOD WE TRUST, ALL OTHERS BRING DATA.”


— W EDWARDS DEMING
OPEN

DATA

APIS
PORTALS
APIs (application programming
A lot of government data is interfaces) are developer tools that
available online. allow you to access data directly
from companies.
You can use census data,
employment data, the general You know how you can type in a URL
social survey, and tons of local and get to a website?
government data such as New York
City’s 911 calls or traffic counts. APIs are like URLs, but instead of a
website, you get data.
Sometimes you can download this
data directly as a CSV file; at other Some examples of companies with
times, you need to use an API. helpful APIs are The New York Times,
Yelp , Spotify, Netflix, or The
You can even submit Freedom of
Weather Channel.
Information Act requests to
government agencies to get data
Some APIs even have R or Python
that isn’t publicly listed.
packages that specifically make it
easier to work with them. rtweet for R,
Government information is great
for example, lets you pull Twitter data
because it’s often detailed and
quickly so that you can find tweets
deals with unusual subjects, such as
with a specific hashtag, what the
data on the registered pet names of
trending topics in Sacramento are, or
every animal in Elk Grove, California.
what tweets Naval Ravikant is
favoriting.
The downside - or potential upside
as it poses a great challenge - of
Keep in mind that there are
government information is that it
limitations and terms of service to
often isn’t well formatted, such as
how you can use these APIs.
tables stored within PDF files.
"YOU CAN HAVE DATA APIs are great for providing extremely
W I T H O U T I N F O R M A T I O N , robust, organized data from many
sources.
BUT YOU CANNOT HAVE
INFORMATION WITHOUT DATA"
-DANIEL KEYS MORAN
YOUR

WEB
OWN
SCRAPING
DATA
Web scraping is a way to extract data
from websites that don’t have an API,
There are many places where you
essentially by automating visiting web
can download data about yourself;
pages and copying the data.
social media websites and email
services are two big ones.
You could create a program to search
a movie website for a list of 100 actors,
But if you use apps to keep track of
load their actor profiles, copy the lists
your physical activity, reading list,
of movies they’re in, and put that data
budget, sleep, or anything else, you
can usually download that data as
in a spreadsheet. You do have to be
well.
careful, though: scraping a website
can be against the website’s terms of
Maybe you could build a chatbot
use, and you could be banned. You
based on your emails with your
can check the robots.txt file of a
colleagues or friends. Or you could
website to find out what is allowed.
look at the most common words
you use in your tweets and how
You also want to be nice to websites:
those words have changed over
if you hit a site too many times, you
time.
can bring it down.

Perhaps you could track your


But assuming that the terms of service
caffeine intake and exercise for a
allow it and you build in time between
month to see whether you can
your hits, scraping can be a great way
predict how much and well you
to get unique data.
sleep.

The advantage of using your own


data is that your project is
guaranteed to be unique: no one
else will have looked at that data
before!

“IF WE HAVE DATA, LET’S LOOK AT DATA. IF ALL


WE HAVE ARE OPINIONS, LET’S GO WITH MINE.”
— JIM BARKSDALE
OPEN ML
OpenML.org is an open science
online platform for machine
learning, which holds open data, UC
open algorithms and tasks.

One of the core components of


OpenML are datasets. People can IRVINE
upload their datasets, and the
system automatically organizes
these on line. The UCI Machine Learning Repository
is a collection of databases, domain
Each dataset has it’s own unique ID. theories, and data generators that are
Information about the dataset, the used by the machine learning
data features and the data qualities community for the empirical analysis
can be obtained automatically by of machine learning algorithms.
means of API functions, or
downloaded manually as a CSV file. The archive was created as an ftp
archive in 1987 by David Aha and
Every dataset gets a dedicated page fellow graduate students at UC Irvine.
with all known information,
including a wiki, visualizations, Since that time, it has been widely
statistics, user discussions, and the used by students, educators, and
tasks in which it is used. researchers all over the world as a
primary source of machine learning
At the time of this writing, OpenML data sets.
has nearly 22,000 datasets
available! As an indication of the impact of the
archive, it has been cited over 1000
times, making it one of the top 100
most cited "papers" in all of computer
science.

The current version of the web site


was designed in 2007 by Arthur
Asuncion and David Newman.

Visit it here:
https://archive.ics.uci.edu/ml/datas
ets.php

““ABOVE ALL ELSE, SHOW THE DATA.”


– EDWARD R. TUFTE
PROPUBLICA
ProPublica is an independent,
nonprofit newsroom that produces
investigative journalism with moral
force.

The ProPublica Data Store gives you


access to the data behind our
reporting and helps to sustain the GOOGLE
challenging, expensive work of
investigative reporting.

They provide free access to the raw DATASET


data behind our work, as well as
premium data products and custom
data services. T
SEARCH
hese and other initiatives support
ProPublica’s mission of investigative
journalism in the public interest.
You can get to Google's dataset search
directly by visiting:
Visit propublica.org/datastore to
datasetsearch.research.google.com
browse data sets about Health,
Criminal Justice, Education, Politics,
Similar to how Google Scholar works,
Business, Transportation, Military,
Dataset Search lets you find datasets
Environment, Finance, or Religion.
wherever they’re hosted, whether it’s a
publisher's site, a digital library, or an
author's personal web page.

Dataset Search enables users to find


datasets stored across the Web
through a simple keyword search.

The tool surfaces information about


datasets hosted in thousands of
repositories across the Web, making
these datasets universally accessible
and useful.

“ERRORS USING INADEQUATE DATA ARE MUCH


LESS THAN THOSE USING NO DATA AT ALL.”
– CHARLES BABBAGE
OPEN DATA

STACK

EXCHANGE

Open Data Stack Exchange is a


question and answer site for REDDIT
developers and researchers interested
in open data.

It's built and run by the community as


part of the Stack Exchange network of
DATASETS
Q&A sites.
Reddit's /r/datasets is a place to
With the help of the research and data share, find, and discuss Datasets.
community, they're working together
to build a library of detailed answers Users have posted an eclectic mix
to every question about open data. of datasets about gun ownership,
NYPD crime rates, college student
The site is all about getting answers. study habits and caffeine
It's not a discussion forum. concentrations in popular
beverages.
Good answers are voted up and rise to
the top. The best answers show up You're sure to find awesome data
first so that they are always easy to here. Visit reddit.com/r/datasets to
find. learn more.

At this time the site is in beta mode,


but it is still very useful to help you
find data - or ask someone if they
know where the type of data you're
looking for can be found.

Visit opendata.stackexchange.com
to check it out.

“WITHOUT A SYSTEMATIC WAY TO START AND


KEEP DATA CLEAN, BAD DATA WILL HAPPEN.”
— DONATO DIORIO
AWS

OPEN DATA
Amazon makes large data sets
available on its Amazon Web
Services platform:
aws.amazon.com/opendata ACADEMIC
You can download the data and
work with it on your own computer,
or analyze the data in the cloud
using EC2 and Hadoop via EMR.
TORRENTS

You can read more about how the


Academic Torrents is designed to
program works here.Amazon has a
facilitate storage of all the data used
page that lists all of the data sets
in research, including datasets as well
for you to browse.
as publications.
You’ll need an AWS account,
It's a distributed system for sharing
although Amazon gives you a free
enormous datasets - for researchers,
access tier for new accounts that
by researchers.
will enable you to explore the data
without being charged.
The result is a scalable, secure, and
fault-tolerant repository for data, with
Here are some examples:
blazing fast download speeds.
Lists of n-grams from Google
Checkout academictorrents.com to
Books — common words and
see all they have to offer as well as
groups of words from a huge set
documentation about their API.
of books.
Common Crawl Corpus — data
from a crawl of over 5 billion
web pages.
Landsat images — moderate
resolution satellite images of the
surface of the Earth.

“WITH DATA COLLECTION, ‘THE SOONER THE


BETTER’ IS ALWAYS THE BEST ANSWER.”
– MARISSA MAYER
WIKIPEDIA
As part of Wikipedia’s
commitment to advancing
knowledge, they offer all of their
content for free, and regularly
BIGQUERY
generate dumps of all the
articles on the site.

Additionally, Wikipedia offers PUBLIC


edit history and activity, so you
can track how a page on a topic
evolves over time, and who
contributes to it. DATA
You can find the various ways to
download the data on the Much like Amazon, Google also has a
Wikipedia site. You’ll also find cloud hosting service, called Google
scripts to reformat the data in Cloud Platform.
various ways.
With GCP, you can use a tool called
Here are some examples: BigQuery to explore large data sets.

All images and other media Visit here to learn more:


from Wikipedia — all the cloud.google.com/bigquery/public-
images and other media files data
on Wikipedia.
Here are some examples:
Full site dumps — of the
content on Wikipedia, in USA Names — contains all Social
various formats. Security name applications in the
US, from 1879 to 2015.
Github Activity — contains all
public activity on over 2.8 million
public Github repositories.
Historical Weather — data from
9000 NOAA weather stations from
1929 to 2016.

“WE’RE ENTERING A NEW WORLD IN WHICH DATA MAY


BE MORE IMPORTANT THAN SOFTWARE.”
– TIM O’REILLY
QUANDL
Quandl is a repository of
economic and financial data.

Some of this information is free,


but many data sets require DATA.GOV
purchase.

Quandl is useful for building Data.gov makes it possible


models to predict economic to download data from multiple
indicators or stock prices. US government agencies.

Due to the large amount of Data can range from government


available data sets, it’s possible budgets to school performance scores.
to build a complex model that
uses many data sets to predict Much of the data requires additional
values in another. research, and it can sometimes be
hard to figure out which data set is
Visit quandl.com/search to the “correct” version.
browse available dataset.
Anyone can download the data,
Here are some examples you although some data sets require
might find: additional hoops to be jumped
through, like agreeing to licensing
Entrepreneurial activity by agreements.
race and other factors —
contains data from the Here are some examples:
Kauffman foundation on
entrepreneurs in the US. Food Environment Atlas — contains
Chinese macroeconomic data data on how local food choices
— indicators of Chinese affect diet in the US.
economic health. School system finances — a survey
US Federal Reserve data — US of the finances of school systems in
economic indicators, from the the US.
Federal Reserve.. Chronic disease data — data on
chronic disease indicators in areas
across the US.

“THINK ANALYTICALLY, RIGOROUSLY, AND


SYSTEMATICALLY ABOUT A BUSINESS PROBLEM AND
COME UP WITH A SOLUTION THAT LEVERAGES THE
AVAILABLE DATA.”
– MICHAEL O’CONNELL

You might also like