0% found this document useful (0 votes)
46 views27 pages

DEV Lab Manual-1

Important question

Uploaded by

gowrismiley353
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
46 views27 pages

DEV Lab Manual-1

Important question

Uploaded by

gowrismiley353
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 27
EXNO:1 DATE: INSTALLING DATA ANALYSIS JALIZATION TOOL To write a steps to install data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI PROCEDURE: R: + R is a programming language and software environment specifically designed for statistical computing and graphics Windows: * Download R from the official website: https//cran.r-project org/mirrors.html + Run the installer and follow the installation instructions, macOS: + Download R for macOS from the official website: htps://eran.r-project.org/mirrors.hitml * Open the downloaded file and follow the installation instructions, Linux: You can typically install R using your distribution’s package manager. For example, on Ubuntu, you can use the following commane: sharp Copy code sudo apt-get install r-base Python: + Python is a versatile programming language widely used for data analysis, You can install Python and data analysis libraries using a package manager like conda or pip. Windows: * Download Python from the official website: https:/Avww python.org/downloads'windows * Run the installer, and make sure to check the "Add Python to PATH" option during installation, * You can install data analysis libraries like NumPy, pandas, and matplotlib using pip. macOS: + macOS typically comes with Python pre-installed. You can install additional packages using pip or set up a virtual environment using Ana * conda, Linux: + Python is often pre-installed on Linux. Use your distibution's package manager to install Python if it's not already installed. You can also use conda or pip to manage Python packages. Tableau Public: © Tableau Public is a fice version of Tableau for creating and sharing interactive data visualizations, * Goto the Tableau Public website: https://public.tableau.com/s/gallery * Download and install Tableau Public by following the instructions on the website Power BI: Power BI is a business analytics service by Microsoft for creating interactive reports and dashboards. Go to the Power BI website: hitps://powerbi,microsoft.com/en-us/downloads/ Download and install Power BI Desktop, which is the tool for creating reports and dashboards Please note that the installation steps may change over time, so it's a good idea to check the official websites for the most up-to-date instructions and download links. Additionally, system requirements may vary, so make sure your computer meets the necessary specifications for these tools. Exploratory Data Aualysis (EDA) on with Datasets Aim: To Perform exploratory data analysis (EDA) on with datasets like email data set. Procedure: Exploratory Data Analysis (EDA) on email datasets involves importing the data, cleaning it, visualizing it, and extracting insights. Here's a step-by-step guide on how to perform EDA on an email dataset using Python and Pandas Import Necessary Libraries: Import the required Python libraries for data analysis and visualization. Load Email Data: Assuming you have a folder containing email files (e.g., .eml files), you can use the email library to parse and extract the email contents, Data Cleaning: Depending on your dataset, you may need to clean and preprocess the data. Common cleaning steps include handling missing values, converting dates to datetime format, and removing duplicates. Data Exploration: Now, you can start exploring the dataset using various techniques. Here are some common EDA tasks: Basic Statistics: Get summary statistics of the dataset Distribution of Dates: ‘Visualize the distribution of email dates Word Cloud for Subject or Message: Create a word clond to visualize common words in email subjects or messages Top Senders and Recipients: Find the top email senders and recipients. Depending on your dataset, you can explore further, analyze sentiment, perform network analysis, or any other relevant analysis to gain insights from your email data. Program: +# Import necessary libraries import pandas as pd import matplotlib pyplot as plt import seabom as sis # Load the dataset f= pd.read_csv(D:\ARCHANA \dxv\LAB\DXV'Einaildataset.csv’) # Display basic information about the dataset print(af info) # Display the first few rows of the dataset print(dfheadQ) # Descriptive statistics print(df.describe()) # Check for missing values print(€f.isnull0.sum0) # Visualize the distribution of numerical variables sns pairplot(dt) plt.show() # Visualize the distribution of categorical variables sns.countplot(x=abel’, data=di) pltshow() # Correlation matrix for nnmerical variables comrelation_matrix = df.corr sns.heatmap(correlation_matrix, annot=True, cmap=coolwarm’) plt.show() + Word cloud for text data (if you have a column with text data) from wordclond import WordCloud text_data =" join(dff'text_column')) wordcloud = WordCloud(width=800, height=400, random_state=21 max_font_size=110),generate(text_data) plt.figure(figsize-(10. 7) plt.imshow(wordeloud, interpolation="bilinear") pltaxis(off) pltshow() OUT PUT: Data columns (total 4 cohumns): # Column —Non-Null Count Dtype 0 Unnamed: 0 5171 non-null. int64 1 label $171 non-null object 2 text 5171 non-null object 3 label_ num 5171 non-null int64 dtypes: int64(2), object(2) memory usage: 161.7+ KB None Unnamed: 0 label text label_oum 0 60S ham Subject: enron methanol ; meter # : 988291 \rin... 2349 ham Subject: pl nom for january 9 , 2001'r\n( see 3624 ham Subject: neon retreatirinho ho ho , we 're ar. 4685 spam Subject: photoshop , windows , office . cheap 2030 ham Subject: re : indian springs\r\nthis deal is t Unnamed: 0 label_num count 5171.000000 5171.000000 mean 2585.000000 0.289886 std 1492.883452 0.453753 min 0.000000 0.000000 25% —1292.500000 0.000000 2585.000000 0.000000 3877.500000 1.000000 max 5170,000000 1.000000 Unnamed: 00 label 0 text 0 label mm 0 dtype: inte 2000 4000 Unnamed: 0 Result: The above Performing exploratory data analysis (EDA) on with datasets like email data set has been performed successfully Ex no: 03 king with Numpy arrays, Pandas data frames , Basic plots using Matplotlib Write the steps for Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib Procedure: 1. NumPy: ‘NumPy is a fundamental library for numerical computing in Python. It provides support for multi- dimensional arrays and various mathematical fimetions, To get started, you'll first need to install NumPy if you havent already (you can use pip) ‘Once NumPy is installed, you can use it as follows: import numpy as np # Creating NumPy avrays aur = np.array((1, 2, 3, 4, 5]) print(arr) # Basic operations ‘mean = np.mean(arr) sum = np.sum(art) # Mathematical fictions square_root = np.sqtt(art) exponential = np.exp(arr) # Indexing and slicing first_element = ar{0] sub_array = arr{1 4] # Array operations combined_amray = np.concatenate({arr, sub_array]) OUTPUT: rig (0) Kerr prt, rigs res the nec crasplatfrm Pore tps /akas/pcores Users VOUNMestalgython sale & ':\sers run vepac\Local cose indoors otha. Lee Users WOU. coe at 1-20.19 gythaFles Loyang aap. /.deugy\Lanctr’ ano" "°C Wars Desc python saolo\a. ears VON skp 2. Pandas: Pandas is a powerful library for data manipulation and analysis ‘You can install Pandas using pip’ pip install pandas Here's how to work with Pandas DataFrames import pandas as pd + Creating a DataFrame from a dictionary data = { ‘Name’: [Alice’, ‘Bob’, ‘Charlie’, David’, Emily’), 2), 10s Angeles’, ‘Chicago’, ‘Houston’, ‘Miami’] df= pd DataFrame(data) + Display the entire DataFrame print("DataFrame:") print(dt) # Accessing specific columns print(’\Accessing ‘Name’ column:") print ai{ Nene’) anew column Atf’Salary’] = [50000. 60000, 75000, 48000, 55000] + Filtering data print("\nPeople older than 30:") print(df{dt{'Age’] > 30]) # Sorting by a column print("\nSorting by ‘Age’ in descending order:") print(df.sort_values(by='Age’, ascending=False)) # Aggregating data print("\nAverage age:") print(df{'Age’].mean() # Grouping and aggregation grouped_data = df. groupby('City’)['Salary’].mean() print("\wAverage salary by city: print(grouped_data) + Applying a funetion to a column af[’Age_Squared'] = dff'Age'].apply(lambda x: x ** 2) + Removing a column df= df-drop(columns=[‘Age_Squared']) # Saving the DataFrame to a CSV file dfto_csv(output.csv’, index=False) # Reading a CSV file into a DataFrame new_df = pd.read_esvoutput.csv') print("\DataFrame from CSV file:") print(new_dt) 25 New York 3 Los Angeles 35 Chicago 28 Houston 2 Miami column: Charlie David enily Name: Name, dtype: object People older than 30: Name Age City Salary 2 Charlie 35 Chicago 75000 Sorting by ‘Age’ in descending order: Name Age City Salary 2 Charlie Chicago 7508 1 Bob Los Angeles 3 david Houston @ Alice New York 4 Emily Miami Average age 28.0 Average salary by city city Chicago 75200.8 Houston 48000.0 Los Angeles 60000.0 Miami 55000.0 New York seeee.2 Nane: Salary, dtype: floates DataFrane fron CSV file: Name Age City Salary Alice 25 New York 50008 Bob 30 Los Angeles 60000 Charlie 35 Chicago 75000 David 28 Houston 48000 emily 22 Miami 55000 PS 0: VARCHANA\dx\LAB> 3. Matplotlib: Matplotlib is a popular library for creating static, animated, or interactive plots and graphs, Install Matplotlib using pip’ stall matplotlib: Here's a simple example of creating a basic plot: import matplotlib.pyplot as plt # Sample data x =np.linspace(0, 10, 100) = np sin(x) # Create a line plot pl figure(figsize=(8, 6)) pltplot(x, y, label='Sine Wave’) plt.title(Sine Wave Plot’) pltxlabel('X-axis’) plt ylabel('Y-axis!) plt.legendQ) plt.grid(True) pltshow() OUTPUT: 4 ree Sine Wave Plot HEP HEQE RESULT: Thus the above working with aumpy, pandas, matplotlib has been completed successfully Exno:4 Exploring various variable and row filters in R for cleaning data Exploring various variable and row filters in R for cleaning data PROCEDURE: Data Preparation and Cleaning First, let's create a sample dataset and then explore various variable and row filters to clean the data # Create a sample dataset set.seed(123) data <- data fiame( ID=1:10, Age = sample(18:60, 10, replace = TRUE), Gender = sample(c("Male", "Female"), 10, replace = TRUE), Score = sample(1:100, 10) ) + Print the sample data print(data) OUTPUT: 1D Age Gender Score 1143 Male 99 2 2 32 Fenale 303 31 «Male 4 4.28 vale 5 5 59 Male 6 6 6 Male 7 7 54 Fenale 8 8 31 Male 9 9 42 Male 1010 43 Male Variable Filters 1. Filtering by a Specific Value: To filter rows based on a specific value in a variable (¢.g., only show rows where Age is greater than 30); filtered_data <- data{dataSAge > 30. ] Filtering by Multiple Conditions: You can filter rows based on multiple conditions using the & (AND) or | (OR) operators (¢.g., show rows where Age is greater than 30 and Gender is "Male"): filtered_data <- datafdataSAge > 30 & dataSGender = "Male", ] Row Filters 1. Removing Duplicate Rows: To remove duplicate rows based on certain columns (¢.g., remove duplicates based on 'TD') cleaned_data <- unique(datal, c("ID", "Age", "Geuder")}) 2. Removing Rows with Missing Values: To remove rows with missing values (NA): cleaned_data <- na.omit(data) Data Visualization 1, Apply various plot features using the ggplot2 package to visualize the cleaned data. # Load the ggplot2 package ibrary(ggplot2) # Create a scatterplot of Age vs. Score with points colored by Gender ggplot(data = cleaned_data, aes(x = Age, y = Score, color = Gender) + geom_point() + labs(title = "Scatterplot of Age vs. Score", X= "Age! # Create a histogram of Age ggplot(data = cleaned data, aes(x = Age)) + geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5) + labs(title = "Histogram of Age", "Age". y ="Frequency") # Create a bar chart of Gender distribution ggplot(data = cleaned_data, aes(x = Gender)) + geom_bar(fill = "green", alpha = 0, labs(title = "Gender Distribution’, "Gender", ‘ount") RESULT: Thus the above Exploring various variable and row filters in R for cleaning data EXNO: 5 PERFORM EDA ON WINE QUALITY DATA SET. DATE AIM: To write a program to Perform EDA on Wine Quality Data Set, PROGRAM: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the dataset data = pd.read_csv("pathname") # Display the first few rows of the dataset print(data.head()) # Get information about the dataset print(date.info()) # Summary statistics print(date.describe()) # Distribution of wine quality sns.countplot (datal ‘quality’ ]) plt.title(" Wine Quality data set") plt.show() # Box plots for selected features by wine quality features = [‘alcohol', ‘volatile acidity’, ‘citric acid’, ‘residual sugar'] for feature in features: plt. figure(figsize=(8, 6)) sns.boxplot(x='quality’, y=feature, data=data) plt.title(f'{feature} by Wine Quality’) + plt.show() Pair plot of selected features sns.pairptot(data, vars=['alcohol', ‘volatile acidity’, ‘citric acid’, ‘residual sugar'], hue='quality', diag_kind='kde') plt.suptitle("Pair Plot of Selected Features") pt. show() # Correlation heatmap corr_matrix = data.corr() plt, figure (figs: 10, 8) sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt: plt.title( "Correlation Heetmap") plt. show() # Histograms of selected featur features = [‘alcohol', ‘volatile acidity’, ‘citric acid’, ‘residual sugar'] for feature in features: plt.figure(figsize=(6, 4)) sns.histplot (data[ feature], kde=True, bins=20) plt.title(f"Distribution of {feature}") plt.show() OUTPUT: Bey, meet ee Distribution of Wine Quality HEP EQE alcohol by Wine Quality volatile acidity by Wine Quality voltie acany citric acid by Wine Quality ui HEP EQ residual sugar by Wine Quality tual separ binebo #¢> 6058 Pair Plt of Seldcted Features RESULT: Thus the above program to to Perform EDA on Wine Quality Data Set. DATE: TIME SERIES ANALYSIS USING VARIOUS VISULAIZATION TECHNIQUES AIM: To perform time series analysis and apply the various visualization techniques. DOWNLOADING DATASET: Step 1: Open google and type the following path in the address bar and download a dataset hittp://github.com/jbrownlee/Datasets, Step 2: write the following code to get the details from pandas import read_esv from matplotlib import pyplot seriessread_csv( ‘pathname’ print (series .head()) series. plot() pyplot.show() OUTPUT: CALA & ters pa To get the time series line plot: series. plot (styl pyplot.show() Dtgaet HEP QE 9 RI ey A> er ABMs FNL. ARORA ABO To create a Histogram: series hist() pyplot. show() (46% FQ=58 Step 5: To create density plot: series. plot (kind='kde') pyplot. show() Result: Thus the above time analysis has been checked with Various visualization techniques. EX DATE: DATA ANALYSIS AND REPRESENTATION ON A MAP AIM: Write a program to perform data analysis and representation on a map using various map data sets With mouse rollover effect, user interaction. PROCEDURE: STEP 1: * Make sure to install the necessary libraries. pip install geopandas folium bokeh PROGRAM: from bokeh. io import show from bokeh.models import ColumnDataSource, HoverTool from bokeh.plotting import figure from bokeh. layouts import column import pandas as pd import folium # Load your data data = pd.read_csv('D: \ARCHANA\dxv\LAB\DXV\ geographic. csv") # Create @ Bokeh figure p = figure(width=800, height=400, tools='pan,wheel_zoom,reset') # Create a ColumnDataSource to hold data source = ColumnDataSource(data) # Add circle markers to the figure p.circle(x='Longitude', y='Latitude’, size=10, sourcessource, color="orange’) # Create a hover tool for mouse rollover effect hover = HoverTool() hover.tooltips = [("Info*, "@Info"), ("Latitude", "@latitude"), ("Longitude”, "@Longitude")] p-add_tools(hover) # Display the Bokeh plot Layout = colunn(p) show( Layout) # Create a map centered at a specific location m = folium.Map(location=[latitude, longitude], zoom_start=10) # Add markers for your data points for index, row in data. iterrows(): » + folium.Marker( > + » location=[row['Latitude’], rowl'Longitude']], + + + + popup=row[ "Info'], ,# Display additional info on mouse click y+ )eadd_to(m) # Save the map to an HTML file m.save(‘map. html) | Dy fee Pot XD bokeh Pot GO File | DYARCHANA/drv/LAB/OXV/geo.htmt 12M OR Be OB x x | D) Bokeh Pot XD) Bokeh Pot x + ae DYARCHANASev/LAB/OXV/geahiml A or eee ac oone 8 e@e 0 Ox + RESULT: Data analysis and representation on a map using various map data sets with mouse rollover effect, user interaction has been completed successfully BUILDING CARTOGRAPHIC VISUALIZATION Build cartographic visualization for multiple datasets involving various countries of the world; states and districts in India ete PROCEDURE: STEP 1: Collect Datasets Gather the datasets containing geographical information for countries, states, or districts. Make sure these datasets include the necessary attributes for mapping (¢.g., country/state/district names, codes, and relevant data) STEP 2: Install Required Libraries: pip install geopandas matplotlib STEP 3: Load Geographic Data: Use Geopandas to load the geographic data for countries, states, or districts, Make sure to match the geographical data with your datasets based on the common attributes. STEP 4: Merge Datasets: Merge your datasets with the geographic data based on common attributes. This step is crucial for linking your data to the cortesponding geographic regions. STEP 5: Create Cartographic Visualizations: Use Matplotlib to create cartographic visualizations. You can create separate plots for different datasets or overlay them on a single map. STEP 6: Customize and Enhance: Customize your visualizations based on your needs. You can add legends, labels, titles, and other elements to enhance the interpretability of your maps. STEP 7: Save and Share: Save your visualizations as image files or interactive plots if needed, You can then share these visualizations with others. PROGRAM: import pandas as pa import geopandas as gpd import shapely # needs ‘descartes’ import matplotlib.pyplot as plt f= pd DataFrame( ‘ity’: [Berlin’, Paris’, "Munich’] ‘latitude’ [52.51861 1111111, 48,856666666667, 48.1372. longitude’: [13.408333333333, 2. 3516666666667. 11.57: edf = apd.GeoDataFrame( df drop({ latitude’ longitude’), axis) rs={ init’: epsg:4326%, geometry—[shapely. geometry Point(xy) for xy in zip(dflongitude, df tatitude))) print(edt) ‘world = gpd.read_file(gpd.datasets.get_pathnaturalearth_lowres')) base = world plot(color=‘white’, edgecolorblack’) gdt.plot(ax=base, marker’ pitshow() colored), markersize=5) OUTPUT: city geometry 0 Berlin POINT (13.40833 52.51861) 1 Paris POINT (2.35167 48.85667) 2 Munich POINT (1.57556 48.13722) Gru RESULT: Build cartographic visualization for multiple datasets involving various countries of the world: has been visualized successfully VISUALIZING VARIOUS EDA TECHNIQUES AS CASE STUDY FOR IRIS DATASET Use a case study on a data set and apply the various EDA and visualization techniques and present an analysis report PROCEDURE: Import Libraries: Start by importing the necessary libraries and loading the dataset. Descriptive Statistics: Compute and display descriptive statist python Check for Missing Values: Verify if there are any missing values in the dataset. Visualize Data Distributions: Visualize the distribution of numerical variables, python Correlation Heatmap Examine the correlation between numerical variables Boxplots for Categorical Variables: Use boxplots to visualize the distribution of features by species Violin Plots: Combine box plots with kemel density estimation for better visualization, Correlation between Features: Visualize pair-wise feature correlations. Conclusion and Summary Summarize key findings and insights from the analysis. This case study provides a comprehensive analysis of the Iris dataset, including data exploration, descriptive statistics, visualization of data distributions, correlation analysis, and feature-specific visualizations.

You might also like