Scraping data off Wikipedia: three ways (no code and code)

Scraping data off Wikipedia: three ways (no code and code)

Get your hands on excellent manually annotated datasets with Google Sheets or Python

I needed some data for my side project. Google helped me find only outdated CSVs, expensive closed sources and APIs hidden behind “contact sales” button.

Wikipedia was ranking high in my searches but I quickly found that the tables are as unstructured as the ones you typically create in Word or Google Docs. There was no consistency in columns and naming conventions. This at first discouraged me but after failing to find another datasource, I gave it a go. And I ended up using it for my project.

Let me show you how to quickly load Wikipedia data for your data analysis. I was astonished how easy it is.

Big thanks to all Wikipedia contributors doing the hard work.

Method 1: Loading tables in Google Sheets

This one feels like magic.

Use this formula in one cell, and it will expand the entire table scraped from Wikipedia.

=importHTML(WIKIPEDIA_URL, "table", NUMBER_OF_TABLE_ON_THE_PAGE);

Let’s say you want to fetch a list of all EU countries with their population count, GDP, language and other essential data.

There’s a Wikipedia page containing all of this information and citing sources of the information(!): https://en.wikipedia.org/wiki/Member_state_of_the_European_Union.

The table containing this info is the second table on the page (the first one is the little summary table). The formula will look like:

=importHTML("https://en.wikipedia.org/wiki/Member_state_of_the_European_Union", "table", 2);

After running it, you’ll get all of this data in your Google Sheets, ready to analyze or export.

Google Sheets can also read lists! Full docs here.

Method 2: Pandas and Python

I was surprised to find out that the most popular library for data analysis for Python has a feature equivalent to Google Docs.

import pandas as pd

url = "https://en.wikipedia.org/wiki/Member_state_of_the_European_Union"

dfs = pd.read_html(url)

df = dfs[1]

read_html function from pandas takes site URL as an input and fetches all tables to a list of dataframes. Again, the second one is what we are looking for. And it’s ready for further analysis!

Since Google Docs was not something I intended to stick with, I used this method in addition to method 3 to gather data from several pages.

Full documentation here.

Method 3: Python and Beautiful Soup

Well, Beautiful Soup is a go to library for parsing HTML in Python ecosystem. I looked for easy solutions but I needed to fall back on it in the end. The reason was I needed to get not only data from tables but also from headings that preceded them.

Instead of writing the code from scratch, I recommend using ChatGPT and refining its output. I basically pasted a url, said what I needed to scrape and got the script without giving any guidance.

If you wanted to do something similar to me - getting all tables from a page and data from multiple levels of headings preceding them - you would do something like this.

import requests
from bs4 import BeautifulSoup

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table', {'class': 'wikitable'})

all_airports = []

for table in tables:
  continent = table.find_previous('h2').find('span', class_='mw-headline').text
  region = table.find_previous('h3').find('span', class_='mw-headline').text
  country = table.find_previous('h4').find('span', class_='mw-headline').text

  rows = table.find_all('tr')

  for row in rows[1:]:
      cols = row.find_all('td')

      # here push the data wherever you need

It looks pretty neat, doesn’t it? I was worried that scraping “by hand” would be a pain but actually Wikipedia usually has a pretty clear, predictable HTML structure. It varies page to page but adheres to the same patterns.

This method in addition to using pandas helped me get all the data I needed.

Summary

As you can see, getting data for your analyses or side projects can be nice and easy.

There is a catch though - there are no data consistency mechanisms in tables inside Wikipedia pages. This means that after using any of the above methods, you’ll likely need to do some cleaning and check for mismatches in naming conventions. Some regular expressions will come in handy as well.

At times I am cynical about technological progress but discovering that information access can be so straightforward makes me feel that I can really stand on the shoulders of giants when building on my own.

If you liked this content, please support me by sharing this post and subscribing to my Newsletter. I will soon be releasing a crazy interesting project I needed data from Wikipedia for!

You can also find me here:

Did you find this article valuable?

Support Karol Horosin: AI, Engineering & Product by becoming a sponsor. Any amount is appreciated!