The document discusses activities undertaken during a virtual internship including learning about pandas DataFrame functions, EDA, and data visualization. It provides details on various DataFrame methods and functions learned for exploring, cleaning, and manipulating data in pandas.
The document discusses activities undertaken during a virtual internship including learning about pandas DataFrame functions, EDA, and data visualization. It provides details on various DataFrame methods and functions learned for exploring, cleaning, and manipulating data in pandas.
The document discusses activities undertaken during a virtual internship including learning about pandas DataFrame functions, EDA, and data visualization. It provides details on various DataFrame methods and functions learned for exploring, cleaning, and manipulating data in pandas.
The document discusses activities undertaken during a virtual internship including learning about pandas DataFrame functions, EDA, and data visualization. It provides details on various DataFrame methods and functions learned for exploring, cleaning, and manipulating data in pandas.
Name of the Student and Roll No. Nandini Singh / 220617005
Name of the Company Samatrix.io Period of the Report Week 1st / 2nd / 3rd / 4th / 3rd 5th / 6th / 7th / 8th / 9th / 10th Activities undertaken during the week Details of the activity: Pandas, DataFrame functions, Different types of functions like , describe, info, columns. Project : Data set importing. Various EDA functions on Dataset. Exploratory data Analyis ( EDA) Data visualization in pandas. Using of data in kaggle. Details of field trips under taken (if any) and As it is an Virtual Internship, so in this no such field summary of results of such trips trips are taken yet. Learning Points acquired from above activities We got to learn about many activities done in this week like:- Pandas DataFrame is two- dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Pandas DataFrame consists of three principal components, the data, rows, and columns. Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Creating a dataframe using List. Dealing with Rows and Columns Indexing and Selecting Data. Working with Missing Data. Iterating over rows and columns. Dropping missing values using dropna() : DataFrame Methods: FUNCTIONDESCRIPTION index() Method returns index (row labels) of the DataFrame insert() Method inserts a column into a DataFrame. add()Method returns addition of dataframe and other, element-wise (binary operator add). sub()Method returns subtraction of dataframe and other, element-wise (binary operator sub). mul()Method returns multiplication of dataframe and other, element- wise . div()Method returns floating division of dataframe and other, element-wise. unique()Method extracts the unique values in the dataframe nunique()Method returns count of the unique values in the dataframe. value_counts() Method counts the number of times each unique value occurs within the Series. columns() Method returns the column labels of the DataFrame axes() Method returns a list representing the axes of the DataFrame. isnull() Method creates a Boolean Series for extracting rows with null values. notnull()Method creates a Boolean Series for extracting rows with non- null values. between()Method extracts rows where a column value falls in between a predefined range. isin() Method extracts rows from a DataFrame where a column value exists in a predefined collection. dtypes()Method returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. astype()Method converts the data types in a Series. values() Method returns a Numpy representation of the DataFrame i.e. only the values in the DataFrame will be returned, the axes labels will be removed. sort_values()- Set1, Set2 Method sorts a data frame in Ascending or Descending order of passed Column. sort_index() Method sorts the values in a DataFrame based on their index positions or labels instead of their values but sometimes a data frame is made out of two or more data frames and hence later index can be changed using this method. loc[] Method retrieves rows based on index label. iloc[] Method retrieves rows based on index position. ix[] Method retrieves DataFrame rows based on either index label or index position. This method combines the best features of the .loc[] and .iloc[] methods. rename() Method is called on a DataFrame to change the names of the index labels or column names. columns()Method is an alternative attribute to change the coloumn name. drop()Method is used to delete rows or columns from a DataFrame pop()Method is used to delete rows or columns from a DataFrame sample()Method pulls out a random sample of rows or columns from a DataFrame. nsmallest()Method pulls out the rows with the smallest values in a column. nlargest()Method pulls out the rows with the largest values in a column. shape() Method returns a tuple representing the dimensionality of the DataFrame. ndim()Method returns an ‘int’ representing the number of axes / array dimensions. Returns 1 if Series, otherwise returns 2 if DataFrame. dropna()Method allows the user to analyze and drop Rows/Columns with Null values in different ways fillna()Method manages and let the user replace NaN values with some value of their own. rank()Values in a Series can be ranked in order with this method query() Method is an alternate string-based syntax for extracting a subset from a DataFrame. copy()Method creates an independent copy of a pandas object. duplicated()Method creates a Boolean Series and uses it to extract rows that have duplicate values. drop_duplicates()Method is an alternative option to identifying duplicate rows and removing them through filtering. set_index()Method sets the DataFrame index (row labels) using one or more existing columns. reset_index()Method resets index of a Data Frame. This method sets a list of integer ranging from 0 to length of data as index. where() Method is used to check a Data Frame for one or more condition and return the result accordingly. By default, the rows not satisfying the condition are filled with NaN value. EDA is applied to investigate the data and summarize the key insights. It will give us the basic understanding of our data, it’s distribution, null values and much more. We can either explore data using graphs or through some python functions. There will be two type of analysis. Univariate and Bivariate. In the univariate, we will be analyzing a single attribute. But in the bivariate, we will be analyzing an attribute with the target attribute. In the non-graphical approach, we will be using functions such as shape, summary, describe, isnull, info, datatypes and more. In the graphical approach, we will be using plots such as scatter, box, bar, density and correlation plots. Data Visualization with Pandas is the presentation of data in a graphical format. It helps people understand the significance of data by summarizing and presenting a huge amount of data in a simple and easy-to-understand format and helps communicate information clearly and effectively. Pandas DataFrame Plots There are several plot types built-in to pandas, most of them statistical plots by nature: df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie This is the different types of dataframe by which one can Visualize there data or datasets. Kaggle is the world's largest data science community with powerful tools and resources to help us achieve our data science goals. Using of data in kaggle. allows us to create our own custom datasets, share them with others and easily import them into our notebooks. Additionally, we can add private datasets which would only be visible to us. The different types of dataset in kaggle are integers, floats, booleans, and strings. Kaggle also supports special BigQuery Datasets. These are the learning points which I have learnt from the above activity. Plan for the next week Project ( Pandas and Data Visualization.) Any leave taken during the week No Any other point No such other points as of now .