0% found this document useful (0 votes)
7 views40 pages

Python For Data Science

The document provides an overview of a Python for Data Science course presented at the 2017 Summer School by experts from Sciences Po and CNRS. It covers key topics such as Python programming, Jupyter notebooks, web scraping using BeautifulSoup, and data visualization with Pandas. The course aims to dispel myths about coding and equip participants with practical skills for data analysis and visualization.

Uploaded by

marciolazaro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views40 pages

Python For Data Science

The document provides an overview of a Python for Data Science course presented at the 2017 Summer School by experts from Sciences Po and CNRS. It covers key topics such as Python programming, Jupyter notebooks, web scraping using BeautifulSoup, and data visualization with Pandas. The course aims to dispel myths about coding and equip participants with practical skills for data analysis and visualization.

Uploaded by

marciolazaro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Python for data science

Alexandre Chevallier, Jérémy Richard, Geneviève Michaud, Baptiste Rouxel

To cite this version:


Alexandre Chevallier, Jérémy Richard, Geneviève Michaud, Baptiste Rouxel. Python for data science.
Doctoral. Summer School 2017, Trivedi Centre for Political Data (TCPD) Ashoka University, India.
2017. �hal-03923285�

HAL Id: hal-03923285


https://sciencespo.hal.science/hal-03923285v1
Submitted on 4 Jan 2023

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.

Distributed under a Creative Commons Attribution 4.0 International License


Python for data science
#1 Introduction to Python and Jupyter
Trivedi Center for Political Data (TCPD) 2017 Summer School,
Technical module, July 11th 2017

Alexandre Chevallier (CDSP, SciencesPo and CNRS)


Jérémy Richard (SciencesPo)

Creative Commons Attribution 4.0 International (CC BY 4.0)


Introduction
Outline
Adopt Python

Jupyter

Alexandre Chevallier, Jérémy Richard - Python for data science - #1 Introduction - Trivedi Center for Political Data (TCPD) 2017 Summer School - Technical module, July 11th 2017 2
Who are we?
Jérémy Richard, Sysadmin engineer.

Joined Sciences Po in 2012.

Missions: Building IT infrastructures for two laboratories of the scientific department:

● CDSP - Center For Socio-Political Data


● Médialab

3
Who are we?
Alexandre Chevallier, Web Developer.

Joined the CDSP of Sciences Po in 2014

Missions: Develop tools/applications for several projects, among which:


● ELIPSS project (Internet online panel for the Social Sciences)
● Bequali (Qualitative data bank)
● support for CDSP data and metadata related activities

4
Who are we?
- CDSP - Center for Socio Political Data

- CNRS and Sciences Po mixed service unit for social science researchers

- Helps researchers with their data and metadata, collect, curate and disseminated data and metadata.

- Involved in several projects such as:

- ELIPSS ( Digital panel for quantitative surveys)

- beQuali (Archive & share qualitatives studies)

- Vizlab ( Visualization for french election data)

5
Why are we here?
Dispelling a few myths

● Myth 01: I have to go to university to learn how to code.

● Myth 02: I need to be a genius to code

Share IT knowledge

Helping you decide if our techs suit your use

6
Adopt Python - Brief
- Invented by Guido van Rossum, in the Netherlands early 90s

- Python is a general purpose open source programming language

- Often used as a scripting tool

- It could be also called an interpreted language (≠ compiled language)

- Swissknife language

7
Adopt Python - Scope
Data science
Software development
● Data analysis
● Web applications
● Data visualization
● Testing scripts
System information
Education
● Scripting
● Teaching programming
● Deployed on most of Unix systems

8
Adopt Python - Why use it ?
Object-oriented Community support

Native indentation restriction ● Open-Source


● Very clear readable syntax ● Portable
● Free
Powerful
● Very high level data types
● Other language interoperation (e.g. C/C++)
● Python Package Index (pip)

9
Adopt Python - Data Structures
- List → [1, 2, 3, 4, 5, “hello” ] : Ordered series of values

- add data list.append(1)

- Dictionary → { “key” : “value”, “hello” : 1 } : Key/Value data structure

- add data dict[ʻkeyʼ] = ʻvalueʼ

- Tuple → (1, 2, 3, 4, “hello”) : Like list but immutable

10
Jupyter - Python libraries we use
Web Scraping: BeautifulSoup Data Visualization: Pandas

● HTML/XML Parser ● Similar to R, MATLAB, SAS


● Built on top of popular Python parsers like ● Built on top of Numpy, Scipy and
lxml and html5lib Matplotlib
● Designed for quick turnaround projects ● Handle a vast majority of typical use cases
in finance, statistics, social science, and
many areas of engineering

11
Jupyter - Brief
Web platform for Data Science. Create and share notebooks that contains:
● live code
Support for over 40 programming languages. ● equations
● visualizations
IPython is an interactive shell for Python. ● text

12
Jupyter - Notebooks
Interactive way to learn, experiment and share
your work.

Tool of choice for Data scientists.

Runs in every recent web browser.

13
Jupyter - Notebook: input and output file formats

14
Alexandre Chevallier - Jérémy Richard
[email protected]

Creative Commons Attribution 4.0 International (CC BY 4.0) 15


Python for data science
#2 Web Scraping
Trivedi Center for Political Data (TCPD) 2017 Summer School,
Technical module, July 11th 2017

Alexandre Chevallier (CDSP, SciencesPo and CNRS)


Jérémy Richard (SciencesPo)

Creative Commons Attribution 4.0 International (CC BY 4.0)


Web scraping
Outline
Web page

BeautifulSoup Library

Practical Works

2
Web Scraping - What is it?
Data Scraping?

● Automated process
● Explore and download raw data
● Grab content
● Convert data in usable format for analysis
● Store data in database or text file

Web Scraping = Data Scraping of web pages

3
Web Scraping - What is a web page ?
Components of a web page

● HTML - Organizes and contains the main content of a web page


● CSS - Add styling to make the page looks nicer
● JS - Javascript files add interactivity to web pages
● Media files - Images, Sounds, Videos, etc.

Interesting content for web scraping = HTML

4
Web Scraping - HTML
HTML is used to create documents on the Web

Very simple and logical

NOT a programming language but a markup


language that uses <tags> like this

The websites you view are basically HTML files


rendered by web browsers

5
Web Scraping - HTML
HTML is organized like a hierarchical tree

Source: Frances Zlotnick 6


Web Scraping -
Inspect the
source
Inspect element
Find HTML nodes
<table> defines a table
<tr> defines a row in a table
<th> defines a table header
cell
<td> defines a cell in table

Use BeautifulSoup to grab it

7
Web Scraping - BeautifulSoup
● A Python library

● Pull out data out of HTML/XML files

● Designed for quick turnaround projects

● Charged with some superb methods

● Open-source, free & well documented

8
Web Scraping - Jump into the code
#Grab node with BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib Import librairies

raw_html =
urllib.urlopen('http://www.elections.in/delhi/mcd-elections/').read()
Download data
Instantiate
soup = BeautifulSoup(raw_html)
BeautifulSoup object
attrs = { 'class':'tableizer-table' }
tables = soup.findAll(attrs=attrs) Access the data
table = tables[0]
rows = table.findAll('tr')

9
Use grabbed data to write a CSV file
Web Scraping - Jump into the code
import csv
Import the CSV library
with open('export.csv', 'wb') as f: Open a file with write permissions
writer = csv.writer(f, delimiter=';')
Handle it with CSV lib’s methods
for row in rows:
csv_row = []
headers = row.findAll('th') Make loops for selecting data
for header in headers: inside table cells.
csv_row.append(header.text) Write it in a python list
cells = row.findAll('td')
for cell in cells:
csv_row.append(cell.text)
writer.writerow(csv_row)
Write list in CSV handle file

10
Web Scraping - Jump into the code

Extraction Result

11
Alexandre Chevallier - Jérémy Richard
[email protected]

Creative Commons Attribution 4.0 International (CC BY 4.0) 12


Python for data science
#3 Data Visualization
Trivedi Center for Political Data (TCPD) 2017 Summer School,
Technical module, July 11th 2017

Alexandre Chevallier (CDSP, SciencesPo and CNRS)


Jérémy Richard (SciencesPo)

Creative Commons Attribution 4.0 International (CC BY 4.0)


Outline
Pandas library

Visualization

Practical works

2
Data Visualization - Brief
Pandas - Panel Data System

Used in production in many companies, especially in financial industries

Suitable for many different kinds of data

Two primary data structures:


● Series (1 dimensional)
● DataFrame (2 dimensional). For Rʼs users, itʼs like Rʼs data.frame on steroids.

3
Data Visualization - Series
1 dimensional

from pandas import Series Import pandas library


data = {'a' : 0., 'b' : 1., 'c' : 2.} Create python ordered dictionary with data
s = Series(data) Instantiate Series object
print(s)
a 0.0
Show variable content
b 1.0
c 2.0

4
Data Visualization - Series
import matplotlib.pyplot as plt
s.plot()
plt.show()

5
Data Visualization - Series
s = s.reindex(['c','a','b'])

print(s)
c 2.0
a 0.0
b 1.0

s.plot()
plt.show()

6
Data Visualization - DataFrame
2 dimensional table data structure

Like data.frame in R

Data manipulation with integrated indexing

Support heterogeneous type of columns

7
Data Visualization - DataFrame
#File input/output
import pandas as pd
data = pd.read_csv('2012-electoral-college.csv', sep=';',
index_col='State')
data.head()
Name Electors Population
State
AK Alaska 3 710000
AL Alabama 9 4780000
AR Arkansas 6 2916000
AZ Arizona 11 6392000
CA California 55 37254000

8
Data Visualization - DataFrame
#Analysis
>>> data.Electors.mean()
10.549019607843137
>>> data.Electors.max()
Name Electors Population ratio
55 State
AK Alaska 3 710000 0.000004
>>> data.loc[data.Electors.argmax(), 'Name'] AL Alabama 9 4780000 0.000002
AR Arkansas 6 2916000 0.000002
'California' AZ Arizona 11 6392000 0.000002
[...]
>>> data.Population.sum()
308746000
>>> data['ratio'] =
data['Electors']/data['Population']
>>> data

9
Data Visualization - DataFrame
#Visualization with matplotlib

import matplotlib.pyplot() as plt

data.Electors.plot.bar()
plt.show()

10
Data Visualization - Go further
And much more
● Group By
● Merge, join, aggregation
● Reshaping and Pivot Tables
● Time based series, date functions
● Multi-index
● ...

11
Alexandre Chevallier - Jérémy Richard
[email protected]

Creative Commons Attribution 4.0 International (CC BY 4.0) 12

You might also like