Author(s): Milan Janosov
TL;DR: This article explores the popularity of 2024 Paris Olympic sports and athletes using Wikipedia data, Python, and visualization techniques. It highlights trends in public interest during the Olympics and provides insights that can be applied across various fields.
Disclaimer: This post has been created automatically using generative AI. Including DALL-E, Gemini, OpenAI and others. Please take its contents with a grain of salt. For feedback on how we can improve, please email us
The 2024 Paris Olympic Games have captured the attention of millions worldwide, with fans eagerly following their favorite sports and athletes. As a data scientist, I set out to quantify this excitement by analyzing Wikipedia data to visualize the popularity of top athletes and Olympic sports. In this article, I share my approach, code, and findings.
Data Collection from Wikipedia
To start, I gathered data from Wikipedia, focusing on the profiles and view counts of Olympic sports and athletes. Using Python’s requests
and BeautifulSoup
libraries, I scraped the Wikipedia page for the 2024 Summer Olympics to extract a list of sports and their respective Wikipedia URLs.
Scraping Wikipedia for Olympic Sports
import requests
from bs4 import BeautifulSoup
import re
url = 'https://en.wikipedia.org/wiki/2024_Summer_Olympics'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
is_sport = False
sports_urls = {}
for res in soup.find_all('a', href=True):
res_text = str(res)
if 'Artistic swimming' in res_text:
is_sport = True
if is_sport:
url = 'https://en.wikipedia.org/' + res['href']
sports_urls[res.text] = url
if 'Wrestling' in res_text:
break
This code helps in extracting the names and URLs of all summer Olympic sports. Next, I analyzed the popularity of these sports by tracking the view counts on their Wikipedia pages.
Assessing the Popularity of Olympic Sports
To evaluate the popularity of different sports, I used the mwviews
library to gather daily Wikipedia page views from two months before the Olympics until the end of the event. This provided a comprehensive dataset for analysis.
Retrieving Wikipedia Page View Counts
from mwviews.api import PageviewsClient
import pandas as pd
p = PageviewsClient(user_agent="[email protected]> Sport analysis")
domain = 'en'
sports_data = {}
sports_count = {}
for sport, url in sports_urls.items():
page = url.split('wiki/')[-1]
data = []
for a,b in p.article_views(domain + '.wikipedia', [page], granularity='daily', start='20240611', end='20240811').items():
data.append({'date' : a, 'count' : b[page]})
df = pd.DataFrame(data)
sports_data[sport] = df
sports_count[sport] = sum(df['count'])
This code allows us to download and sum up the daily view counts for each sport’s Wikipedia page, giving us a clear measure of each sport’s popularity.
Visualizing Overall Sports Popularity
import matplotlib.pyplot as plt
import numpy as np
sorted_sports_data = dict(sorted(sports_count.items(), key=lambda item: item[1], reverse=True))
sports = list(sorted_sports_data.keys())
values = list(sorted_sports_data.values())
fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.Set1(np.linspace(0, 1, len(sports)))
bars = ax.barh(sports, values, color=colors)
ax.set_xlabel('Values')
ax.set_title('Olympic Sports Data')
ax.invert_yaxis()
plt.show()
This bar chart provides a clear comparison of the popularity of different Olympic sports based on Wikipedia views.
Tracking the Popularity of Sports Over Time
Beyond overall popularity, it’s also fascinating to observe how interest in each sport fluctuates over time. By plotting the daily view counts, we can visualize these trends.
Time Series Visualization of Sports Popularity
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
sns.set(style='whitegrid')
f, ax = plt.subplots(1, 1, figsize=(12, 8))
olympic_colors = sns.color_palette("Set3", n_colors=len(sports_data))
for (sport, data), color in zip(sports_data.items(), olympic_colors):
ax.plot(data['date'], data['count'], label=sport, color=color)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=2)
ax.set_title('Sports Participation Over Time', fontsize=16)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
This time series plot reveals interesting patterns, such as the rapid spike in popularity at the start of each sport’s events, followed by a decline as the competitions concluded.
Analyzing Medal Winners
Next, I turned my attention to the athletes who won medals in the 2024 Olympics. By scraping Wikipedia again, I collected data on gold, silver, and bronze medalists.
Scraping Medal Winners’ Data
medal_url = 'https://en.wikipedia.org/wiki/List_of_2024_Summer_Olympics_medal_winners'
response = requests.get(medal_url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
def get_url(text):
soup_text = BeautifulSoup(str(text), 'html.parser')
athlete_links = soup_text.find_all('a', href=True)
athlete_links = [a for a in athlete_links if '2024' not in str(a)]
return athlete_links
def contains_numbers(string):
return bool(re.search(r'\d', string))
def add_medalists(medal_list, medal_html):
for athlete_link in get_url(medal_html):
medal_list.append((athlete_link.text, 'https://en.wikipedia.org/' + athlete_link['href']))
response = requests.get(medal_url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all('table', class_='wikitable')
golds = []
silvers = []
bronzes = []
for idx, table in enumerate(tables):
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) == 4:
event, gold, silver, bronze = cells
add_medalists(golds, gold)
add_medalists(silvers, silver)
add_medalists(bronzes, bronze)
This code gathers the names and Wikipedia URLs of all athletes who won medals, enabling further analysis of their popularity.
Medal Winners’ Popularity Analysis
Using a similar approach as with sports, I tracked the Wikipedia view counts of medal-winning athletes to determine who captured the public’s attention during the games.
athletes_links = {}
for athlete, link in golds: athletes_links[athlete] = link
for athlete, link in silvers: athletes_links[athlete] = link
for athlete, link in bronzes: athletes_links[athlete] = link
atheletes_data = {}
atheletes_count = {}
for idx, (athlete, url) in enumerate(athletes_links.items()):
if idx % 100 == 0:
print(idx)
try:
page = url.split('wiki/')[-1]
data = []
for a,b in p.article_views(domain + '.wikipedia', [page], granularity='daily', start='20240611', end='20240811').items():
data.append({'date' : a, 'count' : b[page]})
df = pd.DataFrame(data)
atheletes_count[sport] = df
atheletes_data[sport] = sum(df['count'])
except:
pass
print('Number of medal-winning athletes with measurable Wiki popularity: ', len(atheletes_data))
This code allowed me to identify the top 20 most popular athletes based on Wikipedia views, highlighting who truly became the stars of the 2024 Paris Olympics.
Conclusion
This analysis demonstrates the power of data-driven insights in understanding the popularity of Olympic sports and athletes. By leveraging Wikipedia data and Python, we’ve uncovered which sports and athletes captured the most attention during the 2024 Paris Olympics. These methods aren’t just limited to sports; they can be applied to any domain where public interest and trends need to be analyzed, providing valuable insights for researchers, marketers, and fans alike.
Crafted using generative AI from insights found on Towards Data Science.
Join us on this incredible generative AI journey and be a part of the revolution. Stay tuned for updates and insights on generative AI by following us on X or LinkedIn.