’Sonic games’ Web Scraping Project
Link to the github repository: https://github.com/eelviral/Video-Games-Site-Web-Scraping
Project Overview: Web Scraping Sonic Game Data
In today's digital age, data is of growing importance. Data drives decisions, insights, and inspires innovations.
In this project, I explored web scraping and data collection. Recognizing the popularity and history of the Sonic the Hedgehog franchise, I decided to go on a mission to extract and organize a structured dataset of Sonic games.
With Python, BeautifulSoup, and other web scraping tools, this project aims to pull game titles, release dates, developers, platforms, and other relevant details from various online sources.
Eventually, what started out as a simple curiosity of mine, transformed into a rich set of data that offered a fascinating look at the evolution of the Sonic gaming franchise.
Below, I detail the steps I took to complete my project, and do my best to delve into the technical aspects.
Step 1: Create our PostgreSQL tables
Before we begin scraping data or populating our database, we need set up our database schema (or, in other words, define the way we will store our data).
For this project, I designed two tables using SQL:
The consoles table will hold information about various gaming consoles, identified by a unique c_id.
The videogames table will store details of individual video games, each identified by a unique videogame_id, and will include attributes such as title, developer, publisher, and release date.
By defining these schemas upfront, we ensure that our data has a structured place to reside once we start the data collection process.
Step 2: Gather the video game lists
Here is the main homepage of the Sonic games history website. This is our starting point. Most of the orange links on this page contain a list of video games. For example:
List of arcade games
List of LCD games
List of games on miscellaneous platforms
List of 1990s games
List of 2000s games
etc.
To be able to look at the lists inside of these links, we need to filter out the HTML we don’t need, so that we can collect the orange link URLs we do need.
We will use the Beautiful Soup 4 module in Python to do this. After we collect these URLs, we are ready to move on to the next step.
Step 3: Gather the video game URL’s
If we click on any of the orange links from the main homepage, we’ll get a list of video games, like the one we see above! (this image shows the list of 1990’s Sonic games). These are all better lists with more data, BUT we can still get more information.
Under the ‘Video game’ column, we see a list of video game title links, highlighted in orange. For example:
Sonic the Hedgehog
Sonic the Hedgehog (8-bit)
Sonic Eraser
Waku Waku Sonic Patrol Car
Sonic the Hedgehog 2 (8-bit)
etc.
These links contain more information about each individual video game, which is exactly what we need.
We will use Python to gather these particular links, so we can move to the next step, and finally start scraping data.
Step 4: Scrape the video game data
This is one of the many webpages we will be web scraping valuable video game data from. Specifically, we will be collecting the:
Video Game Title
Developer(s) of the game
Publisher(s) of the game
Release date of the game
Platform(s) that the game exists on
After we collect all this data, and clean off all the leading whitespace, we will upload the data to neat, and organized tables set up on PostgreSQL.
(source: https://sonic.fandom.com/wiki/Sonic_the_Hedgehog_(1991))
Running the Web Scraping Program
Below, I present a clip of my web scraping program successfully running in Python. As the program scrapes the data from each Sonic game, cleans the data, and uploads that data to the database, the program is also printing out the game data it has “on-hand” before it uploads that data to the database.
The Results
After we run the Python program and scrape the Sonic games websites, here is the data we collect on our database
This is our videogames table in PostgreSQL, filled with the game title data we scraped.
Here is where we store the game title names, whether or not they have a developer (1=has a developer, 0=doesn’t have a developer), the publisher, and the release date for each game.
This is our consoles table in PostgreSQL, filled with all the consoles we viewed while our web scraper program was running.
Here we see a variety of consoles that the Sonic games were produced for, ranging from the Nintendo GameCube, to the Sega CD, the Playstation 2, among other consoles.
To Conclude…
This project allowed us to effectively utilize web scraping techniques to gather valuable data on Sonic games. By cleaning and preprocessing our data, we ensured that our findings were not only accurate but also well-organized within a neat and structured database.
Having this work done, our ‘hypothetical data analysts’ now have a well-made dataset at their fingertips. With it, they are able to make informed insights, compelling visuals, and possess a deeper understanding of the trends and patterns within the Sonic games realm.
This journey—from data extraction to its storage—has been both challenging and rewarding. It reaffirms the importance of data in driving decisions and insights in our modern world.