Futuristic and realistic scenery of Cotahuasi Canyon in Peru, featuring neon-lit pathways, advanced observation platforms, and bioluminescent flora amidst dramatic cliffs and a winding river.

A Data-Driven Exploration of the Stars of the 2024 Paris Olympics

Author(s): Milan Janosov

TL;DR: This article explores the popularity of 2024 Paris Olympic sports and athletes using Wikipedia data, Python, and visualization techniques. It highlights trends in public interest during the Olympics and provides insights that can be applied across various fields.

Disclaimer: This post has been created automatically using generative AI. Including DALL-E, Gemini, OpenAI and others. Please take its contents with a grain of salt. For feedback on how we can improve, please email us

The 2024 Paris Olympic Games have captured the attention of millions worldwide, with fans eagerly following their favorite sports and athletes. As a data scientist, I set out to quantify this excitement by analyzing Wikipedia data to visualize the popularity of top athletes and Olympic sports. In this article, I share my approach, code, and findings.

Data Collection from Wikipedia

To start, I gathered data from Wikipedia, focusing on the profiles and view counts of Olympic sports and athletes. Using Python’s requests and BeautifulSoup libraries, I scraped the Wikipedia page for the 2024 Summer Olympics to extract a list of sports and their respective Wikipedia URLs.

Scraping Wikipedia for Olympic Sports


import requests
from bs4 import BeautifulSoup
import re

url = 'https://en.wikipedia.org/wiki/2024_Summer_Olympics'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')

is_sport = False
sports_urls = {}

for res in soup.find_all('a', href=True):
    res_text = str(res)
    if 'Artistic swimming' in res_text:
        is_sport = True
    
    if is_sport:
        url = 'https://en.wikipedia.org/' + res['href']
        sports_urls[res.text] = url
        
    if 'Wrestling' in res_text:
        break

This code helps in extracting the names and URLs of all summer Olympic sports. Next, I analyzed the popularity of these sports by tracking the view counts on their Wikipedia pages.

Assessing the Popularity of Olympic Sports

To evaluate the popularity of different sports, I used the mwviews library to gather daily Wikipedia page views from two months before the Olympics until the end of the event. This provided a comprehensive dataset for analysis.

Retrieving Wikipedia Page View Counts


from mwviews.api import PageviewsClient
import pandas as pd

p = PageviewsClient(user_agent="[email protected]> Sport analysis")
domain = 'en'

sports_data = {}
sports_count = {}

for sport, url in sports_urls.items():
    page = url.split('wiki/')[-1]
    data = []
    for a,b in p.article_views(domain + '.wikipedia', [page], granularity='daily', start='20240611', end='20240811').items():
        data.append({'date' : a, 'count' : b[page]})

    df = pd.DataFrame(data)
    sports_data[sport] = df
    sports_count[sport] = sum(df['count'])

This code allows us to download and sum up the daily view counts for each sport’s Wikipedia page, giving us a clear measure of each sport’s popularity.

Visualizing Overall Sports Popularity


import matplotlib.pyplot as plt
import numpy as np

sorted_sports_data = dict(sorted(sports_count.items(), key=lambda item: item[1], reverse=True))

sports = list(sorted_sports_data.keys())
values = list(sorted_sports_data.values())

fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.Set1(np.linspace(0, 1, len(sports)))

bars = ax.barh(sports, values, color=colors)
ax.set_xlabel('Values')
ax.set_title('Olympic Sports Data')

ax.invert_yaxis()
plt.show()

This bar chart provides a clear comparison of the popularity of different Olympic sports based on Wikipedia views.

Tracking the Popularity of Sports Over Time

Beyond overall popularity, it’s also fascinating to observe how interest in each sport fluctuates over time. By plotting the daily view counts, we can visualize these trends.

Time Series Visualization of Sports Popularity


import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

sns.set(style='whitegrid')
f, ax = plt.subplots(1, 1, figsize=(12, 8))
olympic_colors = sns.color_palette("Set3", n_colors=len(sports_data))

for (sport, data), color in zip(sports_data.items(), olympic_colors):
    ax.plot(data['date'], data['count'], label=sport, color=color)

ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=2)
ax.set_title('Sports Participation Over Time', fontsize=16)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This time series plot reveals interesting patterns, such as the rapid spike in popularity at the start of each sport’s events, followed by a decline as the competitions concluded.

Analyzing Medal Winners

Next, I turned my attention to the athletes who won medals in the 2024 Olympics. By scraping Wikipedia again, I collected data on gold, silver, and bronze medalists.

Scraping Medal Winners’ Data


medal_url = 'https://en.wikipedia.org/wiki/List_of_2024_Summer_Olympics_medal_winners'

response = requests.get(medal_url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')

def get_url(text):
    soup_text = BeautifulSoup(str(text), 'html.parser')
    athlete_links = soup_text.find_all('a', href=True)
    athlete_links = [a for a in athlete_links if '2024' not in str(a)]
    return athlete_links

def contains_numbers(string):
    return bool(re.search(r'\d', string))   

def add_medalists(medal_list, medal_html):
    for athlete_link in get_url(medal_html):
        medal_list.append((athlete_link.text, 'https://en.wikipedia.org/' + athlete_link['href']))

response = requests.get(medal_url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')

tables = soup.find_all('table', class_='wikitable')

golds = []
silvers = []
bronzes = []

for idx, table in enumerate(tables):
    rows = table.find_all('tr')
    for row in rows:
        cells = row.find_all('td')
        if len(cells) == 4:
            event, gold, silver, bronze = cells
            add_medalists(golds, gold)
            add_medalists(silvers, silver)
            add_medalists(bronzes, bronze)

This code gathers the names and Wikipedia URLs of all athletes who won medals, enabling further analysis of their popularity.

Medal Winners’ Popularity Analysis

Using a similar approach as with sports, I tracked the Wikipedia view counts of medal-winning athletes to determine who captured the public’s attention during the games.


athletes_links = {}
for athlete, link in golds: athletes_links[athlete] = link
for athlete, link in silvers: athletes_links[athlete] = link
for athlete, link in bronzes: athletes_links[athlete] = link

atheletes_data = {}
atheletes_count = {}

for idx, (athlete, url) in enumerate(athletes_links.items()):
    if idx % 100 == 0:
        print(idx)
        
    try:
        page = url.split('wiki/')[-1]
        data = []
        for a,b in p.article_views(domain + '.wikipedia', [page], granularity='daily', start='20240611', end='20240811').items():
            data.append({'date' : a, 'count' : b[page]})
    
        df = pd.DataFrame(data)
        atheletes_count[sport] = df
        atheletes_data[sport] = sum(df['count'])
    except:
        pass

print('Number of medal-winning athletes with measurable Wiki popularity: ', len(atheletes_data))

This code allowed me to identify the top 20 most popular athletes based on Wikipedia views, highlighting who truly became the stars of the 2024 Paris Olympics.

Conclusion

This analysis demonstrates the power of data-driven insights in understanding the popularity of Olympic sports and athletes. By leveraging Wikipedia data and Python, we’ve uncovered which sports and athletes captured the most attention during the 2024 Paris Olympics. These methods aren’t just limited to sports; they can be applied to any domain where public interest and trends need to be analyzed, providing valuable insights for researchers, marketers, and fans alike.

Crafted using generative AI from insights found on Towards Data Science.

Join us on this incredible generative AI journey and be a part of the revolution. Stay tuned for updates and insights on generative AI by following us on X or LinkedIn.