Scrape and Analyze Celebrity Dating Data

The skill to scrape and analyze data gives any computer science student a sense of freedom never experienced before. Much of the world’s data is stored on web pages and people make important decisions based on it, think Finance, Politics, Culture, etc. This data can also reveal history of pretty much anything you are interested in. Imagine how exciting it would be to master scraping any data of your interest, building AI pipeline to analyze it, and creating (maybe commercial) applications to execute decisions in the real world based on what you build? So let’s dig in!

I am interested in verifying the myth “people tend to date outside of their zodiac element“. I am also interested in analyzing the graph of daters: does certain graph motif exist among them. The website whosdatedwho.com has a (reportedly) complete database of celebrity relationships, and is a perfect source of data for my curiosity.
Screen Shot 2018-10-12 at 12.03.32 AM.png
Note that the site has rich relationship and personal information, and you can ask variety of questions before even scraping the data, such as actors or actresses in which film genre tend to date most.
So, the canonical tools for scraping web are three Python libraries (BeautifulSoup, Requests) and Google Chrome Browser. Their documentation is easy enough so you can start writing code just by skimming the intro.
One tricky thing about the website is that although celebrity info is organized by their names, each page has an infinite scroll for you to load all the information. We don’t want to simulate a browser using Selenium in order to get the data, because that is really slow. Thus we will hack around the site and find how they actually store the data. It turned out that you can sometimes turn dynamic pages with infinite scroll to static ones by retrieving XMLHttpRequest header information when the site is loading data, illustrated here.

With this trick, you can get all urls of this site by manually incrementing the page number, and be able to scrape each page. Scraping is done in two stages, getting all the urls and then retrieving all personal information. Below is the Python 3.6 code to get all celebrities’ urls and store them in a pickled dict for next stage:

from bs4 import BeautifulSoup
import requests
from tqdm import tqdm
import string
from collections import defaultdict

base = 'http://www.whosdatedwho.com/popular? \
    letter={}&page={}&_block=data.browseGrid'
all_person_list = defaultdict(lambda: dict())

for l in tqdm(list(string.ascii_lowercase)):
    for n in tqdm(range(1,1000)):
        c = requests.get(base.format(l,n)).content
        soup = BeautifulSoup(c)
        person_list = list(
            map(lambda x: x.a['href'],
            soup.find_all("li", {"class":"ff-grid-box ff-list"}))
        )
        if len(person_list) == 0:
            break
        else:
            for p in person_list:
                p_name = re.findall(r'[^/]+(?=/$|$)',p)[0]
                all_person_list[p_name]['url'] = p

import pickle
with open('all-person-list.pickle','wb') as handle:
    pickle.dump(dict(all_person_list), handle, protocol=pickle.HIGHEST_PROTOCOL)
Then apparently we should use a dictionary to store data at personal level, and below is the code to scrape and store such data in pickled dict:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm
from collections import defaultdict

def get_person_info(l):
    c = requests.get(l).content
    soup = BeautifulSoup(c)
    hist = soup.find_all('div',{'id':'ff-dating-history-table'})
    personal_info = {}
    relation_info = []
    headers = list(map(lambda x: x.text, soup.find_all('div',{'class':'header'})[:3]))
    facts = list(map(lambda x: x.text, soup.find_all('div',{'class':'fact'})[:3]))
    footers = list(map(lambda x: x.text, soup.find_all('div',{'class':'footer'})[:3]))
    for i,h in enumerate(headers):
        if h is not None:
            personal_info[h.lower()] = facts[i].strip()
            if 'death' in footers[i].strip():
                personal_info['dead'] = True
            else:
                personal_info['dead'] = False
            if 'at death' not in footers[i].strip()                  and 'years old' not in footers[i].strip()                 and 'total' not in footers[i].strip():
                personal_info[h.lower()] = footers[i].strip()
    if len(hist) == 0:
        pass
    else:
        table = soup.find_all('div',{'id':'ff-dating-history-table'})[0].find('table')
        x = len(table.findAll('tr'))
        for row in table.findAll('tr')[1:x]:
            col = row.findAll('td')
            name = col[1].getText().strip()
            name_url = col[1].a['href']
            status = col[2].getText().strip()
            time_start = col[4].getText().strip()
            time_end = col[5].getText().strip()
            duration = col[6].getText().strip()
            relation_info.append({'name':name, 'name_url':name_url,'time_start':time_start,
                                  'time_end':time_end,'duration':duration})
    return {'personal':personal_info, 'relation':relation_info}

import pickle
with open('all-person-list.pickle', 'rb') as handle:
    all_person = pickle.load(handle)

for p in tqdm({j:all_person[j] for j in [i for i in all_person][:2]}):
    try:
        all_person[p]['info'] = get_person_info(all_person[p]['url'])
    except:
        print('fail')

with open('all-person-info.pickle','wb') as handle:
    pickle.dump(dict(all_person), handle, protocol=pickle.HIGHEST_PROTOCOL)
Note that the scraping takes a while (~10 hours). To save your time if you are not patient enough to wait, here is the pickle file, and you need Python 3 to use it. Now that the data is ready, let’s analyze. The general strategy for data exploration is to clean the data into nice and tidy Pandas Dataframe and then freely explore the Dataframe until you find something interesting. It takes some data wrangling to do so, as below:
import pickle
with open('all-person-info.pickle', 'rb') as handle:
    info = pickle.load(handle)

import re
import numpy as np
import seaborn as sns
import dateparser
from collections import defaultdict
import networkx as nx
from tqdm import tqdm
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd

def str_to_yr(i):
    if 'year' in i:
        return float(re.findall(r'\d+', i)[0])
    if 'month' in i:
        return float(re.findall(r'\d+', i)[0]) / 12
    else:
        return 'unknown'
def get_yr(i):
    try:
        return re.findall('\d{4}', i)[0]
    except:
        return 'unknown'
def horo_type_f(i):
    temp = {'Aquarius':'air',
     'Aries':'fire',
     'Cancer':'water',
     'Capricorn':'earth',
     'Gemini':'air',
     'Leo':'fire',
     'Libra':'air',
     'Pisces':'water',
     'Sagittarius':'fire',
     'Scorpio':'water',
     'Taurus':'earth',
     'Virgo':'earth',
     'unknown':'unknown'}
    return temp[i]

def plotBar(l):
    zodiac_c = list(zip(*Counter(l).most_common()))
    plt.bar(range(len(zodiac_c[0])), zodiac_c[1], align='center')
    plt.xticks(range(len(zodiac_c[0])), zodiac_c[0],rotation='vertical')
    plt.show()
    return zodiac_c

zodiac = []
zodiac_types = []
age_l = []
dead_l = []
zodiac_couple = []
zodiac_type_couple = []
couple_duration = []
couple_start = []
num_rela = []
name_shorts = []
relation_shorts = []
G=nx.Graph()
i_temp = -1

for name_short in tqdm(info):
    name_shorts.append(name_short)
    i_temp += 1
    if 'zodiac' in info[name_short]['info']['personal']:
        horo = info[name_short]['info']['personal']['zodiac']
    else:
        horo = 'unknown'
    if 'age' in info[name_short]['info']['personal']:
        age = info[name_short]['info']['personal']['age']
    else:
        age = 'unknown'
    if 'dead' in info[name_short]['info']['personal']:
        dead = info[name_short]['info']['personal']['dead']
    else:
        dead = 'unknown'
    zodiac.append(horo)
    horo_type = horo_type_f(horo)
    zodiac_types.append(horo_type)
    age_l.append(age)
    dead_l.append(dead)
    if not len(info[name_short]['info']['relation']) > 0:
        num_rela.append(0)
    if len(info[name_short]['info']['relation']) > 0:
        num_rela.append(len(info[name_short]['info']['relation']))
        for rela in info[name_short]['info']['relation']:
            name_short_other = re.findall(r'[^/]+(?=/$|$)',rela['name_url'])[0]
            relation_shorts.append('+'.join(sorted([name_short, name_short_other])))
            duration = str_to_yr(rela['duration'])
            start = get_yr(rela['time_start'])
            if name_short_other not in info:
                horo_other = 'unknown'
            else:
                if 'zodiac' in info[name_short_other]['info']['personal']:
                    horo_other = info[name_short_other]['info']['personal']['zodiac']
                else:
                    horo_other = 'unknown'
            horo_other_type = horo_type_f(horo_other)
            zodiac_couple.append(tuple(sorted([horo, horo_other])))
            zodiac_type_couple.append(tuple(sorted([horo_type, horo_other_type])))
            couple_duration.append(duration)
            couple_start.append(start)            

people_df = pd.DataFrame({'name': name_shorts,'zodiac':zodiac, 'zodiac_element':zodiac_types,
                          'age': age_l, 'num_rela':num_rela})
relation_df = pd.DataFrame({'couple': relation_shorts, 'start':couple_start, 'duration':couple_duration,
                           'zodiac':zodiac_couple, 'zodiac_element':zodiac_type_couple})

Now the fun part begins, let’s freely explore, this is how the data looks like:

Screen Shot 2018-10-12 at 1.43.38 AM.png

This plots the relationship duration distribution:

temp = [i for i in relation_df.duration[relation_df.duration != 'unknown'].tolist() if i < 80]
sns.distplot(temp, axlabel='year')
download
It is hard to imagine over 50-year relationship, and it is likely an outlier due to data error, but if we pick our centrality measure smart by using median, it shouldn’t affect some of our analysis.
This plots average age of people by their zodiac element, and they are pretty close to each other.
temp = people_df[(people_df.zodiac_element != 'unknown') & (people_df.age != 'unknown') & (people_df.age != 'year old')]
temp['age'] = temp['age'].apply(lambda x : int(x))
temp = temp.groupby(temp.zodiac_element)[['age']].mean()
temp.rename(index=str,columns={'age':'mean_age'}).plot.bar()
download-1
This plots the median length of relationship for celebrities with different zodiac elements, and they are the same. Note median is used due to the long tail and outliers.
temp = people_df[(people_df.zodiac_element != 'unknown') & (people_df.age != 'unknown') & (people_df.age != 'year old')]
temp['age'] = temp['age'].apply(lambda x : int(x))

temp = temp.groupby(temp.zodiac_element)[['num_rela']].median()
temp.rename(index=str,columns={'num_rela':'median_num_rela'}).plot.bar()
download-2.png
Here is the most interesting part, does zodiac element affect dating behavior?
temp = relation_df[(relation_df.duration != 'unknown') &amp; (relation_df.duration != 'year old')
           &amp; (relation_df.zodiac_element.apply(lambda x: x[0]) != 'unknown')
           &amp; (relation_df.zodiac_element.apply(lambda x: x[1]) != 'unknown')]
temp['duration'] = temp.duration.astype(float)
temp1 = temp.groupby(temp.zodiac_element)[['duration']].count()
temp = relation_df[(relation_df.duration != 'unknown') &amp; (relation_df.duration != 'year old')
           &amp; (relation_df.zodiac_element.apply(lambda x: x[0]) != 'unknown')
           &amp; (relation_df.zodiac_element.apply(lambda x: x[1]) != 'unknown')]
temp['duration'] = temp.duration.astype(float)
temp2 = temp.groupby(temp.zodiac_element)[['duration']].median()
temp1['duration'] = temp1['duration'] / 1000
temp1 = temp1.rename(index=str, columns={'duration':'count (Thousands)'})
temp2 = temp2.rename(index=str, columns={'duration':'median_duration'})<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
download-2.png
This plot shows the raw count of all combinations of zodiac element relationships. It is obvious that the data showed strong trend of people dating outside of their zodiac element. So how about the duration of their relationship?
download-3.png
Okay, the duration is still the same for all combinations of zodiac elements, 2 years median, not bad.
It is truly strange that the data shows significant trend of a supposedly myth, did I do math wrong? In order to verify, random simulation is done below and showed that if the myth was not true, then the raw count of all combinations of zodiac element relationships should be pretty close to each other.
n = 10000
l = [0] * n + [1] * n + [2] * n + [3] * n
np.random.shuffle(l)

from itertools import tee

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

plotBar(list(pairwise(l)))
download-4.png

See, all combinations have similar numbers. It would be interesting to see if same trend exists on other dating platforms, and if it persists, then well, the myth would be true and we should date outside of our zodiac element? 😂

Now let’s analyze the data from a different lens: Graph🤯.

Unfinished.

PAC Learning

Introduction

Human brain is a dopamine-based prediction machine, and learning can be seen as a process of creating goal-oriented prediction models. Thus, how do we formalize learning in uncertainty, where we cannot be sure our predictions would be useful in an evolving world? Leslie Valiant proposed one of the earliest such mathematical framework: PAC Learning.

Formulation

PAC Learning refers to Probably Approximately Correct Learning: with high probability (Probably), a limited-goal-oriented-learner predicting (Learning) outcomes not too far off (Approximately Correct) from consistent experience of an objective reality. Now formalize these terms in the sentence: (1) Objective reality (2) Consistent experience (3) Limited-goal-oriented-learner (4) Far off (5) High probability.

(1) Objective reality: We assume Objective reality exists independently from experience, and not vice versa. We define Reality with two abstractions: a set X (instance or input space) and a set C (concept class) of set c (concept). These two abstractions allow arbitrary knowledge representation and thus the possibility of learning and goal-seeking in it.

Specifically, X is a set of encoding of objects in learner’s world, c is subset of X that positively exemplify certain simple or interesting rule in X, and C is a set of subset c over X, expressing some general knowledge representation. For example, say a learner is solely interested in learning people’s height, then X would be the the set of positive real number, c would be a rule mapping a group’s height range [a, b] of X to 1 and its complement to 0, and C expresses the general knowledge of groups’ height ranges.

(2) Consistent Experience: It refers to tuples of the form (x_i, c(x_i)) with unknown target concept c from C, and x_i drawn from X with fixed target distribution D. These tuples are generated from procedure EX(c, D) that runs in unit time. The fixing of distribution D makes the experience consistent, and such experience transcends subjectivity, since it does not depend on a learner.

(3) Limited-goal-oriented-learner: Such a learner receives units of consistent experience bounded by its running time. The goal of the learner is to utilize its received experience to mimic c(x \sim D(X)) with its learned prediction h(x \sim D(X)), assuming h \in C. Quantitatively, the goal of the learner is to minimize error(h) = P_{x \sim D(X)}[c(x) != h(x)]. This learner aims to learn some fundamental rule of the reality by mimicking it, using consistent experience (intrinsically uncertain) limited by its running time.

(4) Far off: The learner’s predictions are not too far off if error(h) \leq \epsilon.

(5) High Probability: Probability of the learner’s predictions not too far off is above a threshold: P_{D}(error(h) \leq \epsilon) \geq \delta. The source of randomness for probability comes both from D and Learner’s possible randomization in its learning algorithm.

This formalization presumes metaphysical predispositions that, by themselves are good answers to some epistemological inquiries. This is the preliminary definition of PAC Learning:

Definition 1 Let C be a concept class over X. We say that C is PAC learnable if there exists an algorithm L with the following property: for every concept c of C, for every distribution D on X, and for all 0<\epsilon<1/2 and 0<\delta<1/2, if L is given access to EX(c, D) and inputs \epsilon and \delta, then with probability at least 1-\delta, L outputs a hypothesis concept h of C satisfying error(h) \leq \epsilon. This probability is taken over the random examples drawn by calls to EX(c, D), and any internal randomization of L.
If L runs in time polynomial in 1/\epsilon and 1/\delta, we say that C is efficiently PAC learnable. We will sometimes refer to the input \epsilon as the error parameter and \delta as the confidence parameter.
— An Introduction to Computational Learning Theory by Michael J. Kearns and Umesh V. Vazirani

A couple important refinements to this preliminary definition can be made in the next post. We will introduce some toy problems below using definition limited in this post.

Example

Not finished.