Psosm Tutorial Notes
Psosm Tutorial Notes
Introduction
This tutorial covers how to install Ubuntu Linux in VirtualBox on Windows, Mac, or Linux.
Ubuntu is a popular Linux distribution that is easy to use for beginners. VirtualBox allows you to
install Ubuntu in a virtual machine so you don't have to disturb your main operating system.
First download the Ubuntu ISO image from the Ubuntu website. The recommended version is
the latest Long Term Support (LTS) release. LTS versions are supported for 3-5 years.
Also download and install VirtualBox for your operating system. VirtualBox allows you to create
and run virtual machines.
Open VirtualBox and click "New" to create a new virtual machine. Give it a name like "Ubuntu
Test", select Linux and Ubuntu 64-bit.
Allocate enough RAM, ideally 2-3 GB if you have it. Create a new virtual hard disk of at least 10
GB.
Installing Ubuntu
In VirtualBox, select the Ubuntu virtual machine and click Start. When prompted, select the
Ubuntu ISO file you downloaded. This will boot the Ubuntu installer.
Follow the prompts to install Ubuntu. Choose "Normal Install" unless you have bandwidth
constraints. Partition the virtual hard drive carefully. Create a swap partition the same size as
the RAM and an "ext4" root partition with the remaining space.
The install takes 30-40 minutes. You will set a username and password which you'll need to log
in after install.
Finishing Up
Once installed, Ubuntu will reboot. Log in with the username and password you set.
You now have Ubuntu running in a virtual machine inside your existing OS! You can use the
terminal and install additional software like Python for programming tasks.
Without Git, you would need to make full copies of your codebase each time you wanted a
snapshot. This takes up a massive amount of storage space. Git uses diffs to only store the
changes between commits, saving storage space.
GitHub provides remote hosting for Git repositories. This allows you to share your code with
others and collaborate. Here's how to connect your local repo to a GitHub remote:
GitHub provides a great web UI for viewing commit history, diffs, collaborating with others, and
more.
Summary
• Git is a powerful version control system that allows tracking code history, branching,
collaboration and more
• Basic commands: init, add, commit, status, log
• GitHub provides hosting for remote repositories to enable collaboration
• Following a basic Git workflow of adding/committing changes helps manage code
Tutorial 2
Week 2 Reddit Tutorial: Data Collection Overview
Introduction to Reddit
Reddit is a social media platform where users interact through comments and upvotes on
posts. It has a community feel with subgroups called subreddits focused on specific topics. For
example, r/olympics is about the Olympics. Users can search for keywords like "India" to find
relevant subreddits, posts, communities and users.
Posts have titles, text bodies, images or videos. They accrue upvotes and comments.
Comments allow layered conversations as users can reply to other comments.
Subreddits have rules, moderators, flairs (like hashtags), and related communities.
To collect Reddit data, first create a Reddit account and app. The app will provide
authentication credentials like a client ID and secret.
Use the Python library PRAW to connect to Reddit with the credentials. The subreddit method
specifies a particular subreddit to pull data from. For example, subreddit("india") collects
posts from r/india.
PRAW returns post data like title, score, ID, URL, number of comments, created time, and body
text. Store this in a Pandas dataframe for analysis.
Save collected data as CSV files to access later. The PRAW documentation explains available
data fields like comments, usernames, post metadata etc.
Key Takeaways
Tutorial 3
Introduction
This tutorial covers how to collect Twitter data from the Twitter API and store it in a MySQL
database. MySQL is an open-source relational database management system that uses SQL.
Installing MySQL
The first step is installing MySQL on your system using the sudo apt-get install mysql-
server command. You will be prompted to set a root password during installation.
To connect Python to MySQL, the mysql-db module needs to be installed using sudo apt-get
install python-mysql-db.
This completes the installation and integration of MySQL with the Python environment.
A database called osn_data is created to store the tweets. Within this database, a table
called tweets is created with tweet_id as the primary key and a text field to store the tweet
content.
The Python script that collects real-time tweets from the Twitter API is modified to insert the
tweets into the MySQL database.
After running the modified script for some time, the MySQL table contains multiple rows of
tweets.
The SELECT * FROM tweets query is used to retrieve the stored tweets. Other queries
like SELECT COUNT(*) FROM tweets can get the number of rows in the table.
This allows effective storage and querying of streaming Twitter data in MySQL database. The
queries can be integrated into the Python script for better data collection and analysis.
Tutorial 4
Introduction
Social media platforms like Facebook and Twitter generate large amounts of data that can be
analyzed as network graphs. Network graphs represent entities like users, pages, or groups as
nodes, and relationships between them as edges. In this post, we will learn how to collect and
visualize Twitter data as a network graph using tools like Twecoll and Gephi.
Representing Twitter Data as a Network Graph
There are different ways to represent a node-edge graph, including adjacency matrices,
GraphML format, and CSV files.
An adjacency matrix is a 2D square matrix with dimensions equal to the number of nodes in
the graph. A 1 in cell (i,j) indicates an edge exists between node i and node j. This is simple to
construct using arrays in a programming language, but can take up a lot of space for sparse
graphs.
GraphML is an XML-based format containing node and edge elements in sequence. Each node
element must have a unique ID attribute and each edge element has source and target
attributes identifying the endpoint nodes of an edge.
To collect Twitter data, we can use the Twecoll command line tool. It gathers information about
your Twitter followers and "friends of friends." The python twecoll commandline prompts you
to authorize a Twitter application and enter credentials. Then it retrieves follower IDs. python
twecoll edgelist adds edges between you, your followers, and their followers.
Gephi is an open-source tool for graph visualization. After installing it, we can open the
GraphML file produced by Twecoll. Gephi displays summary statistics about the graph and can
generate various network metrics like degree distribution, shortest paths, and modularity to
find communities.
The graph layout can be customized by changing color, size, and shape of nodes and edges.
We set the node color based on modularity class to see community structure and size by
degree to highlight key nodes. The edge curvature can also be adjusted to improve readability.
The final graph can be exported as an image or PDF file.
• Degree - The number of edges incident to a node. Related measures are in-degree and
out-degree.
• Data Laboratory - In Gephi, view the nodes, edges, and their attributes as originally
stored in the GraphML file.
Conclusion
This summarizes the key steps covered in the video to collect Twitter data, represent it as a
network graph, analyze graph metrics using Gephi, and customize the graph visualization.
Tutorial 5
Overview
This video explains how to use the Natural Language Toolkit (NLTK) library in Python to
analyze text data collected from Twitter. The goal is to clean and process the text to get
insights into what the tweets are about.
The video starts by reading in 152 tweets collected about the Serum Institute. To analyze the
tweets, the first step is to break them into individual words using NLTK's word tokenizer. This
converts each tweet into a list of word tokens.
All the tokens from all the tweets are added to one large list. This list is passed to the Counter
class from the Python collections module to get a count of each unique word.
The most common words include some useful terms like "Serum" and "India" but also lots of
noise like punctuation. To clean this up:
• All text is lowercased so "Serum" and "serum" map to the same word
• Punctuation is removed using Python's string translate() method
• Stopwords like "the", "of", "at" are removed using NLTK's stopwords corpus
• Tokens less than 2 characters are removed to delete residuals like "rt"
After this cleanup, the most common words provide a much clearer signal on what the tweets
are about. Words like "fire", "vaccine", "Pune", "lives lost" indicate a fire at the Serum Institute
facility in Pune that resulted in loss of life.
Key Takeaways
• NLTK provides useful text processing capabilities like tokenization and stopwords
• Cleaning the text by lowercasing, removing punctuation/stopwords, etc greatly
improves analysis
• Even simple cleaning can reveal insights and topics within noisy text data like tweets
Tutorial 6
Introduction to Gephi
Gephi is an open source network analysis and visualization software used for research projects
in education, journalism, digital humanities etc. It can import social network data from
Facebook, Twitter etc. and generate graphs and clusters.
Loading Data into Gephi
• Gephi can read many file formats like gml, graphml, pajek net, uci net, dl files.
• To import from CSV, two files are needed - one with node list and one with edge list.
• Node CSV should have a column with unique node id.
• Edge list CSV should have source and target columns with node ids for each edge. Can
also include a column indicating edge type - directed or undirected.
Overview Tab
• Appearance tab: Change node/edge color, size based on classical and continuous
attributes.
• Layout tab: Choose from different layout algorithms like Fruchterman Reingold and
customize them.
• Filters tab: Filter nodes and edges based on attributes like degree, followers etc.
• Statistics tab: Compute network metrics like average degree, diameter, modularity etc.
Preview Tab
The Preview tab shows updated visualization options for the generated graph. New layouts
and filters can be tested without affecting existing workspace.
Data Laboratory
The Data Laboratory shows node and edge data as tables. Columns can be edited directly.
Specific nodes/edges can be selected for analysis.
Customizing Visualization
• Change node color, size based on attributes like followers, degree etc.
• Modify edge color, thickness based on weight.
• Adjust labels, fonts, background color.
• Different layout algorithms like Fruchterman Reingold.
• Filter nodes by criteria like followers between a range.
• Create subsets as new workspaces using filters.
Key Features
Tutorial 7
The video introduces how to create interactive data visualizations using Python scripts and
Highcharts, a JavaScript charting library. It covers the following key concepts:
The video demonstrates creating four types of charts from sample data:
Bar Chart
Line Chart
Scatter Plot
Bubble Chart
The Highcharts Cloud is a platform to instantly visualize data by just pasting it. It auto-
generates charts. Benefits:
• Don't need to write any code
• Customizable charts
• Interactive features like filtering, tooltips
• Can save and download charts
Key Takeaways