0% found this document useful (0 votes)
94 views10 pages

Interactive Data Analysis With Jupyter Cheatsheet 1731972443

This cheat sheet provides a comprehensive guide to using Jupyter Notebooks for interactive data analysis, covering basics, magic commands, data import/export, exploration, cleaning, manipulation, visualization with Matplotlib and Seaborn, statistical analysis, and machine learning with Scikit-learn. It includes essential commands and code snippets for each topic, making it a valuable resource for data analysts and scientists. The document is authored by Waleed Mousa.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views10 pages

Interactive Data Analysis With Jupyter Cheatsheet 1731972443

This cheat sheet provides a comprehensive guide to using Jupyter Notebooks for interactive data analysis, covering basics, magic commands, data import/export, exploration, cleaning, manipulation, visualization with Matplotlib and Seaborn, statistical analysis, and machine learning with Scikit-learn. It includes essential commands and code snippets for each topic, making it a valuable resource for data analysts and scientists. The document is authored by Waleed Mousa.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

[ Interactive Data Analysis with Jupyter Notebooks ] ( CheatSheet )

1. Jupyter Notebook Basics

● Start Jupyter Notebook: jupyter notebook


● Create new notebook: Click "New" > "Python 3"
● Run cell: Shift + Enter
● Insert cell above: A
● Insert cell below: B
● Delete cell: D, D (press twice)
● Change cell type to Markdown: M
● Change cell type to Code: Y
● Toggle line numbers: L
● Toggle output: O
● Clear cell output: Clear > Clear Cell Output
● Restart kernel: 0, 0 (press twice)
● Save notebook: Ctrl + S
● Convert to Python script: jupyter nbconvert --to script notebook.ipynb
● Convert to HTML: jupyter nbconvert --to html notebook.ipynb

2. Magic Commands

● List all magic commands: %lsmagic


● Run Python file: %run script.py
● Time cell execution: %%time
● Time multiple executions: %timeit function()
● Display plots inline: %matplotlib inline
● Display plots in a separate window: %matplotlib qt
● Load extension: %load_ext autoreload
● Autoreload modules: %autoreload 2
● Display all variables: %who
● Display all variables of a specific type: %who_ls str
● Delete variable: %reset_selective variable_name
● Run shell command: !ls -l
● Set environment variable: %env MY_VAR=value
● Debug with pdb: %pdb
● Profile code: %prun function()

By: Waleed Mousa


3. Data Import and Export

● Import pandas: import pandas as pd


● Read CSV: df = pd.read_csv('file.csv')
● Read CSV with specific encoding: df = pd.read_csv('file.csv',
encoding='utf-8')
● Read CSV with custom delimiter: df = pd.read_csv('file.csv', sep='\t')
● Read Excel: df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
● Read JSON: df = pd.read_json('file.json')
● Read SQL query: df = pd.read_sql_query("SELECT * FROM table", connection)
● Read from URL: df = pd.read_csv('https://example.com/data.csv')
● Read from clipboard: df = pd.read_clipboard()
● Read multiple CSV files: df = pd.concat([pd.read_csv(f) for f in
glob.glob('*.csv')])
● Write to CSV: df.to_csv('output.csv', index=False)
● Write to Excel: df.to_excel('output.xlsx', index=False)
● Write to JSON: df.to_json('output.json')
● Write to SQL: df.to_sql('table_name', connection, if_exists='replace')
● Write to clipboard: df.to_clipboard()

4. Data Exploration

● Display first rows: df.head()


● Display last rows: df.tail()
● Display random sample: df.sample(n=5)
● Get dataframe info: df.info()
● Get dataframe statistics: df.describe()
● Get column names: df.columns
● Get data types: df.dtypes
● Get dimensions: df.shape
● Check for null values: df.isnull().sum()
● Get unique values: df['column'].unique()
● Get value counts: df['column'].value_counts()
● Get correlation matrix: df.corr()
● Get covariance matrix: df.cov()
● Display all rows: pd.set_option('display.max_rows', None)
● Display all columns: pd.set_option('display.max_columns', None)
● Reset display options: pd.reset_option('display')
● Get memory usage: df.memory_usage(deep=True)

By: Waleed Mousa


● Get column data types and non-null count: df.info(verbose=True,
null_counts=True)
● Get basic information about RangeIndex: df.index
● Get summary of a specific column: df['column'].describe()

5. Data Cleaning

● Drop null values: df.dropna()


● Drop null values in specific columns: df.dropna(subset=['column1',
'column2'])
● Fill null values with a specific value: df.fillna(value)
● Fill null values with column mean: df.fillna(df.mean())
● Fill null values with column median: df.fillna(df.median())
● Fill null values with forward fill: df.fillna(method='ffill')
● Fill null values with backward fill: df.fillna(method='bfill')
● Replace values: df.replace(old_value, new_value)
● Replace values using dictionary: df.replace({'old1': 'new1', 'old2':
'new2'})
● Remove duplicates: df.drop_duplicates()
● Remove duplicates based on specific columns:
df.drop_duplicates(subset=['column1', 'column2'])
● Rename columns: df.rename(columns={'old_name': 'new_name'})
● Change data type: df['column'] = df['column'].astype('int64')
● Convert to datetime: df['date'] = pd.to_datetime(df['date'])
● Handle outliers using IQR: df = df[(df['column'] >
df['column'].quantile(0.25) - 1.5 * (df['column'].quantile(0.75) -
df['column'].quantile(0.25))) & (df['column'] <
df['column'].quantile(0.75) + 1.5 * (df['column'].quantile(0.75) -
df['column'].quantile(0.25)))]
● Strip whitespace from string columns: df = df.apply(lambda x:
x.str.strip() if x.dtype == "object" else x)
● Replace inf and -inf with NaN: df = df.replace([np.inf, -np.inf], np.nan)
● Coerce errors to NaN when changing data types: df['column'] =
pd.to_numeric(df['column'], errors='coerce')
● Drop columns: df = df.drop(['column1', 'column2'], axis=1)
● Reset index: df = df.reset_index(drop=True)

6. Data Manipulation

● Select column: df['column']

By: Waleed Mousa


● Select multiple columns: df[['column1', 'column2']]
● Filter rows: df[df['column'] > value]
● Filter rows with multiple conditions: df[(df['column1'] > value1) &
(df['column2'] < value2)]
● Sort values: df.sort_values('column', ascending=False)
● Sort values by multiple columns: df.sort_values(['column1', 'column2'],
ascending=[True, False])
● Group by: df.groupby('column').agg({'column2': 'mean', 'column3': 'sum'})
● Pivot table: pd.pivot_table(df, values='value', index='index',
columns='columns', aggfunc='mean')
● Melt dataframe: pd.melt(df, id_vars=['id'], value_vars=['column1',
'column2'])
● Merge dataframes: pd.merge(df1, df2, on='key', how='inner')
● Concatenate dataframes: pd.concat([df1, df2], axis=0)
● Apply function to column: df['new_column'] = df['column'].apply(lambda x:
x*2)
● Apply function to multiple columns: df[['col1', 'col2']] = df[['col1',
'col2']].apply(lambda x: x*2)
● Create new column based on conditions: df['new_column'] =
np.where(df['column'] > value, 'High', 'Low')
● Rank values: df['rank'] = df['column'].rank(method='dense',
ascending=False)
● Calculate cumulative sum: df['cumsum'] = df['column'].cumsum()
● Calculate percent change: df['pct_change'] = df['column'].pct_change()
● Shift values: df['previous'] = df['column'].shift(1)
● Get dummies (one-hot encoding): pd.get_dummies(df,
columns=['categorical_column'])
● Bin continuous variable: pd.cut(df['column'], bins=[0, 25, 50, 75, 100],
labels=['Low', 'Medium', 'High', 'Very High'])
● Reshape dataframe: df.pivot(index='date', columns='category',
values='value')
● Explode lists in a column: df = df.explode('list_column')
● Aggregate by time period: df.resample('M', on='date_column').mean()
● Rolling calculations: df['rolling_mean'] =
df['column'].rolling(window=7).mean()
● Expanding calculations: df['expanding_sum'] =
df['column'].expanding().sum()

By: Waleed Mousa


7. Data Visualization with Matplotlib

● Import matplotlib: import matplotlib.pyplot as plt


● Create line plot: plt.plot(x, y)
● Create scatter plot: plt.scatter(x, y)
● Create bar plot: plt.bar(x, height)
● Create horizontal bar plot: plt.barh(y, width)
● Create histogram: plt.hist(data, bins=10)
● Create box plot: plt.boxplot(data)
● Create violin plot: plt.violinplot(data)
● Create pie chart: plt.pie(sizes, labels=labels, autopct='%1.1f%%')
● Create heatmap: plt.imshow(data, cmap='hot'); plt.colorbar()
● Create subplot: fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
● Set title: plt.title('Title')
● Set x-label: plt.xlabel('X-axis')
● Set y-label: plt.ylabel('Y-axis')
● Add legend: plt.legend()
● Set axis limits: plt.xlim(0, 10); plt.ylim(0, 100)
● Add text to plot: plt.text(x, y, 'Text')
● Add annotation: plt.annotate('Annotation', xy=(x, y), xytext=(x+1, y+1),
arrowprops=dict(facecolor='black', shrink=0.05))
● Customize tick labels: plt.xticks(rotation=45, ha='right')
● Add grid: plt.grid(True)
● Set figure size: plt.figure(figsize=(10, 6))
● Save figure: plt.savefig('figure.png', dpi=300, bbox_inches='tight')
● Clear current figure: plt.clf()
● Close all figures: plt.close('all')
● Create 3D plot: from mpl_toolkits.mplot3d import Axes3D; fig =
plt.figure(); ax = fig.add_subplot(111, projection='3d'); ax.scatter(xs,
ys, zs)

8. Data Visualization with Seaborn

● Import seaborn: import seaborn as sns


● Set seaborn style: sns.set_style('darkgrid')
● Create scatter plot: sns.scatterplot(x='x', y='y', data=df)
● Create line plot: sns.lineplot(x='x', y='y', data=df)
● Create bar plot: sns.barplot(x='x', y='y', data=df)
● Create box plot: sns.boxplot(x='x', y='y', data=df)

By: Waleed Mousa


● Create violin plot: sns.violinplot(x='x', y='y', data=df)
● Create swarm plot: sns.swarmplot(x='x', y='y', data=df)
● Create count plot: sns.countplot(x='category', data=df)
● Create heatmap: sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
● Create pair plot: sns.pairplot(df)
● Create joint plot: sns.jointplot(x='x', y='y', data=df, kind='scatter')
● Create distribution plot: sns.distplot(df['column'])
● Create cluster map: sns.clustermap(df.corr())
● Create categorical plot: sns.catplot(x='x', y='y', hue='category',
data=df, kind='bar')
● Create regression plot: sns.regplot(x='x', y='y', data=df)
● Create residual plot: sns.residplot(x='x', y='y', data=df)
● Create facet grid: g = sns.FacetGrid(df, col='category');
g.map(plt.scatter, 'x', 'y')
● Set color palette: sns.set_palette('Set2')
● Customize plot appearance: sns.set_context('paper', font_scale=1.5,
rc={'lines.linewidth': 2.5})

9. Statistical Analysis

● Import scipy stats: from scipy import stats


● Calculate mean: np.mean(data)
● Calculate median: np.median(data)
● Calculate mode: stats.mode(data)
● Calculate standard deviation: np.std(data)
● Calculate variance: np.var(data)
● Calculate skewness: stats.skew(data)
● Calculate kurtosis: stats.kurtosis(data)
● Calculate correlation: df['column1'].corr(df['column2'])
● Calculate Spearman correlation: df['column1'].corr(df['column2'],
method='spearman')
● Calculate covariance: df['column1'].cov(df['column2'])
● Perform t-test: stats.ttest_ind(group1, group2)
● Perform paired t-test: stats.ttest_rel(group1, group2)
● Perform one-way ANOVA: stats.f_oneway(group1, group2, group3)
● Perform chi-square test: stats.chi2_contingency(observed)
● Calculate p-value: stats.norm.sf(abs(z_score)) * 2
● Calculate confidence interval: stats.t.interval(alpha=0.95,
df=len(data)-1, loc=np.mean(data), scale=stats.sem(data))

By: Waleed Mousa


● Perform Shapiro-Wilk test for normality: stats.shapiro(data)
● Perform Kolmogorov-Smirnov test: stats.kstest(data, 'norm')
● Perform Mann-Whitney U test: stats.mannwhitneyu(group1, group2)
● Perform Wilcoxon signed-rank test: stats.wilcoxon(group1, group2)
● Perform Kruskal-Wallis H-test: stats.kruskal(group1, group2, group3)
● Perform Friedman test: stats.friedmanchisquare(group1, group2, group3)
● Calculate effect size (Cohen's d): cohens_d = (np.mean(group1) -
np.mean(group2)) / np.sqrt((np.std(group1) ** 2 + np.std(group2) ** 2) /
2)
● Perform linear regression: slope, intercept, r_value, p_value, std_err =
stats.linregress(x, y)
● Calculate Pearson correlation matrix: df.corr(method='pearson')
● Calculate Kendall's Tau: stats.kendalltau(x, y)
● Perform one-sample t-test: stats.ttest_1samp(data, popmean)
● Perform Levene's test for equality of variances: stats.levene(group1,
group2)
● Perform Bartlett's test for equality of variances:
stats.bartlett(group1, group2)

10. Machine Learning with Scikit-learn

● Import scikit-learn: from sklearn import *


● Split data: X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
● Scale features: X_scaled = StandardScaler().fit_transform(X)
● Normalize features: X_normalized = Normalizer().fit_transform(X)
● Encode categorical variables: X_encoded =
OneHotEncoder().fit_transform(X)
● Select features: selector = SelectKBest(f_classif, k=5).fit(X, y)
● Perform PCA: pca = PCA(n_components=2).fit_transform(X)
● Train linear regression: model = LinearRegression().fit(X_train, y_train)
● Train logistic regression: model = LogisticRegression().fit(X_train,
y_train)
● Train decision tree: model = DecisionTreeClassifier().fit(X_train,
y_train)
● Train random forest: model = RandomForestClassifier().fit(X_train,
y_train)
● Train SVM: model = SVC().fit(X_train, y_train)
● Train k-nearest neighbors: model = KNeighborsClassifier().fit(X_train,
y_train)

By: Waleed Mousa


● Train naive Bayes: model = GaussianNB().fit(X_train, y_train)
● Train gradient boosting: model =
GradientBoostingClassifier().fit(X_train, y_train)
● Make predictions: y_pred = model.predict(X_test)
● Calculate accuracy: accuracy_score(y_test, y_pred)
● Calculate precision, recall, f1-score:
precision_recall_fscore_support(y_test, y_pred, average='weighted')
● Create confusion matrix: confusion_matrix(y_test, y_pred)
● Perform cross-validation: cross_val_score(model, X, y, cv=5)
● Perform grid search: GridSearchCV(model, param_grid, cv=5).fit(X, y)
● Plot ROC curve: fpr, tpr, _ = roc_curve(y_test, y_pred_proba);
plt.plot(fpr, tpr)
● Calculate AUC: roc_auc_score(y_test, y_pred_proba)
● Plot learning curve: learning_curve(model, X, y, cv=5)
● Plot validation curve: validation_curve(model, X, y, param_name,
param_range, cv=5)

11. Deep Learning with TensorFlow and Keras

● Import TensorFlow and Keras: import tensorflow as tf; from tensorflow


import keras
● Create sequential model: model = keras.Sequential()
● Add dense layer: model.add(keras.layers.Dense(64, activation='relu',
input_shape=(input_dim,)))
● Add dropout layer: model.add(keras.layers.Dropout(0.5))
● Add convolutional layer: model.add(keras.layers.Conv2D(32, (3, 3),
activation='relu'))
● Add max pooling layer: model.add(keras.layers.MaxPooling2D((2, 2)))
● Add LSTM layer: model.add(keras.layers.LSTM(64))
● Compile model: model.compile(optimizer='adam',
loss='binary_crossentropy', metrics=['accuracy'])
● Train model: history = model.fit(X_train, y_train, epochs=10,
batch_size=32, validation_split=0.2)
● Evaluate model: model.evaluate(X_test, y_test)
● Make predictions: y_pred = model.predict(X_test)
● Save model: model.save('model.h5')
● Load model: loaded_model = keras.models.load_model('model.h5')
● Plot training history: plt.plot(history.history['accuracy'],
history.history['val_accuracy'])

By: Waleed Mousa


● Use early stopping: early_stopping =
keras.callbacks.EarlyStopping(patience=3)

12. Natural Language Processing

● Import NLTK: import nltk


● Download NLTK data: nltk.download('punkt')
● Tokenize text: tokens = nltk.word_tokenize(text)
● Get sentences: sentences = nltk.sent_tokenize(text)
● Remove stopwords: from nltk.corpus import stopwords; tokens = [word for
word in tokens if word.lower() not in stopwords.words('english')]
● Perform stemming: from nltk.stem import PorterStemmer; stemmer =
PorterStemmer(); stems = [stemmer.stem(word) for word in tokens]
● Perform lemmatization: from nltk.stem import WordNetLemmatizer;
lemmatizer = WordNetLemmatizer(); lemmas = [lemmatizer.lemmatize(word)
for word in tokens]
● Perform part-of-speech tagging: pos_tags = nltk.pos_tag(tokens)
● Extract named entities: named_entities = nltk.ne_chunk(pos_tags)
● Calculate term frequency: from nltk.probability import FreqDist;
freq_dist = FreqDist(tokens)
● Calculate TF-IDF: from sklearn.feature_extraction.text import
TfidfVectorizer; tfidf = TfidfVectorizer().fit_transform(documents)
● Perform topic modeling: from gensim import corpora, models; lda_model =
models.LdaMulticore(corpus, num_topics=10)
● Train Word2Vec model: from gensim.models import Word2Vec; model =
Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
● Perform sentiment analysis: from textblob import TextBlob; sentiment =
TextBlob(text).sentiment
● Perform text classification: from sklearn.naive_bayes import
MultinomialNB; clf = MultinomialNB().fit(X_train, y_train)

13. Time Series Analysis

● Import statsmodels: import statsmodels.api as sm


● Create time series object: ts = pd.Series(data,
index=pd.date_range(start='2023-01-01', periods=len(data)))
● Resample time series: ts_monthly = ts.resample('M').mean()
● Calculate rolling mean: rolling_mean = ts.rolling(window=7).mean()
● Calculate exponential moving average: ema = ts.ewm(span=7).mean()
● Perform seasonal decomposition: result = sm.tsa.seasonal_decompose(ts)

By: Waleed Mousa


● Check stationarity: from statsmodels.tsa.stattools import adfuller;
result = adfuller(ts)
● Make time series stationary: ts_diff = ts.diff().dropna()
● Create ACF plot: from statsmodels.graphics.tsaplots import plot_acf;
plot_acf(ts)
● Create PACF plot: from statsmodels.graphics.tsaplots import plot_pacf;
plot_pacf(ts)
● Fit ARIMA model: model = sm.tsa.ARIMA(ts, order=(1,1,1)).fit()
● Make ARIMA predictions: predictions = model.forecast(steps=5)
● Fit SARIMA model: model = sm.tsa.SARIMAX(ts, order=(1,1,1),
seasonal_order=(1,1,1,12)).fit()
● Perform Granger causality test: from statsmodels.tsa.stattools import
grangercausalitytests; grangercausalitytests(data[['y', 'x']], maxlag=5)
● Create prophet model: from fbprophet import Prophet; model =
Prophet().fit(df)

By: Waleed Mousa

You might also like