Taller Data Science
Taller Data Science
Objetivo: Análisis de datos para la toma de decisiones usando los Dataframes y Pandas
de Python.
Pasos:
1. Descargar Anaconda Navigator 1.9.12 e instalarlo. https://anaconda.softonic.com/
2. Al instalar se obtendrá la siguiente interfaz.
3. Ingresamos a Jupyter Notebook y se crea un nuevo archivo en Python 3.
PANDAS:
Es una librería de Python para Ciencia de Datos y Machine Learning. Permite
trabajar con datos tabulados es decir tipo tablas, por ejemplo, una tabla de Excel,
o una tabla de una base de datos.
Las principales características de esta librería son:
La dos estructuras principales de Pandas son las Series (Vector o una columna) y Data
Frames(matrices).
Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir,
sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir,
no se puede cambiar, aunque si su contenido.
dtype: string
Existen varias propiedades o métodos para ver las características de una serie.
Series
serie = pd.Series([1,2,3])
serie
out 0 1
1 2
2 3
serie.name=’nombre’
serie
out 0 1
1 2
2 3
name: nombre, dtype: int64
Con el archivo de los olímpicos crear algunas Series que permitan evidenciar su
manejo
Data Frames
columna1 columna2
0 1 a
1 2 b
2 3 c
3 4 d
Para mostrar cuantas filas y columnas se tienen
df,shape
out: (4, 2)
df
Instrucción
import pandas as pd
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that
)
df = df.drop('Totals')
df.head()
Out[5]:
Instrucción
import pandas as pd
df = pd.read_csv('d:/olympics.csv')
df.head()
Out[26]:
Instrucción.
print only_gold()
File "<ipython-input-10-d1b348111bce>", line 1
print only_gold()
^
SyntaxError: invalid syntax
# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
# This function returns the row for Afghanistan, which is a Series object. The assignment
# question description will tell you the general format the autograder is expecting
return df.iloc[0]
# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
answer_zero()
Out[11]:
# Summer 13
Gold 0
Silver 0
Bronze 2
Total 2
# Winter 0
Gold.1 0
Silver.1 0
Bronze.1 0
Total.1 0
# Games 13
Gold.2 0
Silver.2 0
Bronze.2 2
Combined total 2
ID AFG
Name: Afghanistan, dtype: object
Instrucciones:
def answer_one():
return df['Gold'].argmax()
answer_one()
Out[12]:
135
Instrucciones:
def answer_two():
return (df['Gold'] - df['Gold.1']).argmax()
answer_two()
Out[13]:
135
Instrucciones:
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
Out[20]:
'Australia'
Instrucciones:
def answer_four():
df['Points'] = df['Gold.2']*3 + df['Silver.2']*2 + df['Bronze.2']*1
return df['Points']
answer_four()
Out[22]:
Afghanistan 2
Algeria 27
Argentina 130
Armenia 16
Australasia 22
...
Yugoslavia 171
Independent Olympic Participants 4
Zambia 3
Zimbabwe 18
Mixed team 38
Name: Points, Length: 146, dtype: int64
Instrucciones:
census_df = pd.read_csv('d:/census.csv',encoding='cp1252')
census_df.head()
Out[45]:
Instrucciones
def answer_five():
df=census_df[census_df['SUMLEV'] == 50]
df = df.groupby( [ "SUMLEV", "STNAME"] ).size().to_frame(name = 'count').reset_index()
return df.loc[df['count'].idxmax()]['STNAME']
answer_five()
Out[46]:
'Texas'
Instrucciones
def answer_six():
result = census_df.copy()
result = result.reset_index()
result = result[result['SUMLEV'] == 50]
columns_to_keep = ['STNAME', 'CENSUS2010POP']
result = result[columns_to_keep]
result = result.sort_values(['STNAME','CENSUS2010POP'], ascending=False)
result = result.groupby('STNAME').head(3)
result = result.groupby("STNAME").sum().sort_values('CENSUS2010POP', ascending=False).reset_index
()
result = list(result['STNAME'].loc[:2])
return result
answer_six()
Out[49]:
['California', 'Texas', 'Illinois']
Instrucciones
def answer_seven():
result = census_df.copy()
result = result.reset_index()
result = result[result['SUMLEV'] == 50]
cols_to_use = ['POPESTIMATE2010',
'POPESTIMATE2011',
'POPESTIMATE2012',
'POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']
result['MinPop'] = result.loc[:, cols_to_use].min(axis=1)
result['MaxPop'] = result.loc[:, cols_to_use].max(axis=1)
result['PopDelta'] = result['MaxPop'] - result['MinPop']
columns_to_keep = ['CTYNAME', 'PopDelta']
result = result[columns_to_keep].sort_values('PopDelta', ascending=False).reset_index()
result = result['CTYNAME'].loc[0]
return result
answer_seven()
Out[51]:
'Harris County'
Instrucciones
def answer_eight():
result = census_df.copy()
result = result.reset_index()
result = result[result['SUMLEV'] == 50]
result = result[(result['REGION'] <= 2)]
result = result[(result['POPESTIMATE2015'] > result['POPESTIMATE2014'])]
#result['CTYNAME'] = result['CTYNAME'].str[0:10]
result = result[(result['CTYNAME'].str[0:10] == 'Washington')].sort_index()
cols_to_keep = ['STNAME', 'CTYNAME']
result = result[cols_to_keep]
return result
answer_eight()
Out[53]:
STNAME CTYNAME