The Glowing Python: histogram

Showing posts with label histogram. Show all posts

Friday, June 4, 2021

The Central Limit Theorem, a hands-on introduction

The central limit theorem can be informally summarized in few words: The sum of x₁, x₂, ... x_n samples from the same distribution is normally distributed, provided that n is big enough and that the distribution has a finite variance. to show this in an experimental way, let's define a function that sums n samples from the same distrubution for 100000 times:

import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt

def sum_random_variables(*kwarg, sp_distribution, n):
    # returns the sum of n random samples
    # drawn from sp_distribution
    v = [sp_distribution.rvs(*kwarg, size=100000) for _ in range(n)]
    return np.sum(v, axis=0)

This function takes in input the parameters of the distrubution, the function that implements the distrubution and n. It returns an array of 100000 elements, where each element is the sum of n samples. Given the Central Limit Theorem, we expect that the values in output are normally distributed if n is big enough. To verify this, let's consider a beta distribution with parameters alpha=1 and beta=2, run our function increasing n and plot the histogram of the values in output:

plt.figure(figsize=(9, 3))
N = 5
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(1, 2, sp_distribution=sps.beta, n=n)
    plt.hist(s, density=True)
plt.tight_layout()

On the far left we have the histogram with n=1 , the one with n=2 right next to it, and so on until n=4. With n=1 we have the original distribution, which is heavily skewed. With n=2 we have a distribution which is less skewed. When we reach n=4 we see that the distribution is almost symmetrical, resembling a normal distribution.

Let's do the same experiment using a uniform distribution:

plt.figure(figsize=(9, 3))
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(1, 1, sp_distribution=sps.beta, n=n)
    plt.hist(s, density=True)
plt.tight_layout()

Here we have that for n=2 the distribution is already symmetrical, resembling a triangle, and increasing n further we get closer to the shape of a Gaussian.

The same behaviour can be shown for discrete distributions. Here's what happens if we use the Bernoulli distribution:

plt.figure(figsize=(9, 3))
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(.5, sp_distribution=sps.bernoulli, n=n)
    plt.hist(s, bins=n+1, density=True, rwidth=.7)
plt.tight_layout()

We see again that for n=2 the distribution starts to be symmetrical and that the shape of a Gaussian is almost clear for n=4.

Sunday, June 16, 2013

2D Histrograms with Plotly

Plotly is an online tool that makes us able to create wonderful interactive visualizations of our data. It can plot data from csv files, spreadsheet, etc. but it also has a Python sandbox where we can put our Python snippets! In this post we will see a simple example that shows how to plot a 2D histogram in Plotly.

First, we need a snippet to generate some random sets of data:

from numpy import *
 
# generate some random sets of data
y0 = random.randn(100)/5. + 0.5 
x0 = random.randn(100)/5. + 0.5 
 
y1 = random.rayleigh(size=20)/7. + 0.1
x1 = random.rayleigh(size=20)/8. + 1.1
 
y2 = random.randn(50)/10. + 0.9
x2 = random.rayleigh(size=50)/10. + 0.1
 
y3 = random.randn(50)/8. + 0.1
x3 = random.randn(50)/8. + 0.1
 
y = concatenate([y0,y1,y2,y3])
x = concatenate([x0,x1,x2,x3])

The distribution of the variable x looks like:

The distribution of the variable y looks like: And the 2D histogram of both variables looks like this:

As showed in the colorbar, cells with lighter colors correspond to high density areas of the our distribution.

All the plots above were made with Plotly inside their Python sandbox using the following code:

## place the data into Plotly's dict format

# histograms
histx = {'x': x, 'type':'histogramx'}
histy = {'y': y, 'type':'histogramy'}
hist2d = {'x': x, 'y': y, 'type':'histogram2d'}

# scatter plots above the 1D histograms
# "jitter" the scatter plot points to make their distribution easier to distinguish
jitterx = {'x': x, 'y': 60+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

jittery = {'x': y, 'y': 35+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

# scatter points in the 2D histogram
xy = {'x': x, 'y': y, 'type':'scatter','mode':'markers','marker':{'size':5,'opacity':0.5,'symbol':'square'}}

# NOTE: the following lines plot all the graph above
plot([histx, jitterx], layout={'title': 'Distribution of Variable 1'})
plot([histy, jittery], layout={'title': 'Distribution of Variable 2'})
plot([hist2d,xy], layout={'title': 'Distribution of Variable 1 and Variable 2'})

Plots made with Plotly automatically provide interactions (click-drag to zoom, double-click to autoscale, shift-click to pan) and are very easy to embed in web page using the embedding snippet.

Thanks to the Plotly guys for providing the code of this post and this amazing tool :)

Monday, April 1, 2013

Real-time Twitter analysis

The twitter API is a great tool for analyze tweets by code. In particular, the streaming API gives real time access to the global stream of tweets and, unlike a conventional REST API, it is used through a continuous connection to the Twitter servers. So it requires a persistent HTTP connection open as long as you need to collect tweets. The typical workflow of an application which uses this API is the following:

The easiest way to handle an HTTP streaming in Python is to use PyCurl, the Python bindings for the famous Curl network library. PyCurl allows you to provide a callback function that will be executed every time a new block of data is available. The following code is a simple demonstration of how we can use HTTP streaming with PyCurl in order to analyzie a stream of tweet:

from __future__ import division
from collections import defaultdict
from pylab import barh,show,yticks
import pycurl
import simplejson
import sys
import nltk
import re

def plot_histogram(freq, mean):
 # using dict comprehensions to remove not frequent words
 topwords = {word : count 
             for word,count in freq.items() 
             if count > round(2*mean)}
 # plotting
 y = topwords.values()
 x = range(len(y))
 labels = topwords.keys()
 barh(x,y,align='center')
 yticks(x,labels)
 show()

class TwitterAnalyzer:
 def __init__(self):
  self.freq = defaultdict(int)
  self.cnt = 0
  self.mean = 0.0
  # composing the twitter stream url
  nyc_area = 'locations=-74,40,-73,41'
  self.url = "https://stream.twitter.com/1.1/statuses/filter.json?"+nyc_area

 def begin(self,usr,pws):
  """ 
    init and start the connection with twitter using pycurl
    usr and pws must be valid twitter credentials
  """
  self.conn = pycurl.Curl()
  # we use the basic authentication, 
  # in future oauth2 could be required
  self.conn.setopt(pycurl.USERPWD, "%s:%s" % (usr, pws))
  self.conn.setopt(pycurl.URL, self.url)
  self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
  self.conn.perform()

 def on_receive(self,data):
  """ Handles the arrive of a single tweet """
  self.cnt += 1
  tweet = simplejson.loads(data)
  # a little bit of natural language processing
  tokens = nltk.word_tokenize(tweet['text']) # tokenize
  tagged_sent = nltk.pos_tag(tokens) # Part Of Speech tagging
  for word,tag in tagged_sent:
   # filter sigle chars words and symbols
   if len(word) > 1 and re.match('[A-Za-z0-9 ]+',word):
    # consider only adjectives and nouns
    if tag == 'JJ' or tag == 'NN':
     self.freq[word] += 1 # keep the count
  # print the statistics every 50 tweets
  if self.cnt % 50 == 0:
   self.print_stats()

 def print_stats(self):
  maxc = 0
  sumc = 0
  for word,count in self.freq.items():
   if maxc < count:
    maxc = count
   sumc += count
  self.mean = sumc/len(self.freq)
  print '-------------------------------'
  print ' tweets analyzed:', self.cnt
  print ' words extracted:', len(self.freq)
  print '   max frequency:', maxc
  print '  mean frequency:', self.mean

 def close_and_plot(self,signal,frame):
  print ' Plotting...'  
  plot_histogram(self.freq,self.mean)
  sys.exit(0)

In the constructor of this class we initialize a dictionary that will contain the frequency of each word, a string that contains the url of the service we need to call (composed in order to query twitter for the tweets in NYC) and the variables cnt and mean to keep track of the number of tweets analyzed and of the mean frequency over all the words.
In the method begin, we use the PyCurl library for the authentication to Twitter and start the connection. In particular, we set that the method on_receive is the callback function demanded to processing of the incoming. In this method the actual analysis is done, every tweet is split in tokens and a part of speech tagging is performed. Then, the frequency of all the words that are adjective or nouns is updated.
The method print_stats is used to print the our statistics on the console while close_and_plot plots an histogram using the frequencies in the dictionary and closes the program.
Let's use this class:

import signal
usr = 'supersexytwitteruser'
pws = "yessosexyiam"

ta = TwitterAnalyzer()
# invoke the close_and_plot() method when a Ctrl-D arrives
signal.signal(signal.SIGINT, ta.close_and_plot)
ta.begin(usr,pws) # run the analysis

In the code above, a TwitterAnalyzer object is instantiated, its method close_and_plot is registered as handler for the SIGINT signal and, finally, begin is invoked.
This code starts a program which analyzes all the tweet of the New York Area in real time and prints the statistics every 50 tweets, just like follows:

-------------------------------
 tweets analyzed: 50
 words extracted: 110
   max frequency: 8
  mean frequency: 1.12727272727
-------------------------------
 tweets analyzed: 100
 words extracted: 200
   max frequency: 22
  mean frequency: 1.235
-------------------------------
 tweets analyzed: 150
 words extracted: 286
   max frequency: 29
  mean frequency: 1.26573426573
-------------------------------
 tweets analyzed: 200
 words extracted: 407
   max frequency: 39
  mean frequency: 1.31203931204
-------------------------------
 tweets analyzed: 250
 words extracted: 495
   max frequency: 49
  mean frequency: 1.37575757576

Pressing Cntrl-D we can stop the program and plot a bar chart of adjectives and nouns detected. This is what I got in the morning of March 21, 2013:

We see that is very common to post a link in a tweet (turned out that http is considered as a noun most of the time) and that the words day, today, good and morning were the most used during the analysis.

Monday, January 14, 2013

Box-Muller Transformation

The Box-Muller transform is a method for generating normally distributed random numbers from uniformly distributed random numbers. The Box-Muller transformation can be summarized as follows, suppose u₁ and u₂ are independent random variables that are uniformly distributed between 0 and 1 and let

then z₁ and z₂ are independent random variables with a standard normal distribution. Intuitively, the transformation maps each circle of points around the origin to another circle of points around the origin where larger outer circles are mapped to closely-spaced inner circles and inner circles to outer circles.
Let's see a Python snippet that implements the transformation:

from numpy import random, sqrt, log, sin, cos, pi
from pylab import show,hist,subplot,figure

# transformation function
def gaussian(u1,u2):
  z1 = sqrt(-2*log(u1))*cos(2*pi*u2)
  z2 = sqrt(-2*log(u1))*sin(2*pi*u2)
  return z1,z2

# uniformly distributed values between 0 and 1
u1 = random.rand(1000)
u2 = random.rand(1000)

# run the transformation
z1,z2 = gaussian(u1,u2)

# plotting the values before and after the transformation
figure()
subplot(221) # the first row of graphs
hist(u1)     # contains the histograms of u1 and u2 
subplot(222)
hist(u2)
subplot(223) # the second contains
hist(z1)     # the histograms of z1 and z2
subplot(224)
hist(z2)
show()

The result should be similar to the following:

In the first row of the graph we can see, respectively, the histograms of u₁ and u₂ before the transformation and in the second row we can see the values after the transformation, respectively z₁ and z₂. We can observe that the values before the transformation are distributed uniformly while the histograms of the values after the transformation have the typical Gaussian shape.

Tuesday, July 12, 2011

Dice rolling experiment

If we roll a die a large number of times, and we compute the mean and variance, as exaplained here, we’d expect to obtain a mean = 3.5 and a variance = 2.916. Let's simulate that with a script:

import pylab
import math
 
# Rolling the die 1000 times
v = pylab.randint(1,7,size=(1000))

print 'mean',pylab.mean(v)
print 'variance',pylab.var(v)

pylab.hist(v, bins=6) # histogram of the outcoming
pylab.xlim(1,6)
pylab.show()

Here's the result:

mean 3.435
variance 2.781775