Data studies can be difficult to execute if you aren’t a data scientist, developer, or have hours to spend manually collecting data.
I am always on the lookout for a unique angle to use freely available or potentially scrapable data sources. And I have, admittedly, spent hours upon hours trying to learn Python for writing simple web scraper applications—only to discover the data isn’t accessible, interesting or differentiated enough from what’s already out there.
Thankfully, there is an easy way to collect data from the web without learning a programming language: Google Sheets.
In this video and article, I will explain the simple function I use most often to quickly pull web data within Google Sheets, an example use-case, and to avoid common pitfalls.
How to Scrape Data Using ImportXML
There are import functions for a number of structured data sources, including HTML, CSV, and more. We will focus on ImportXML because it is a simple way to import structured data from multiple sources such as CSV, TSV, HTML, and more all in one function—unlike ImportHTML, for example, which only imports HTML tables and lists.
In Google Sheets, the formula will be formatted with two arguments:
The “XPath” can be copied from HTML source code when inspecting using the Chrome dev tools. Let’s dig into an example to learn how to use this function. But first, we will introduce some basic terminology that will come in handy.
Brief Introduction to HTML, XML, and XPath
To save you from actually learning any programming languages (as I advised against earlier), let’s briefly introduce HTML, XML, and XPath.
HTML and XML are similar in that they are both used as markup languages, and in fact, they stand for HyperText Markup Language and eXtensible Markup Language, respectively. The key difference is that XML exists to simply store and transport data—it doesn’t manipulate, display, or otherwise interact with the data. HTML, on the other hand, is used to display data (like the words you are currently reading).
XPath stands for XML Path Language and is used to navigate an XML document. Basically, a site’s data is displayed using HTML, stored in XML format, and we use the XPath syntax to query, or ask for, that data.
Case Study: Scraping Craigslist Data
Let’s say you are doing a study of the average price of a car or truck on Craigslist, compared across two markets. But if used correctly, this strategy could be used for scraping data from nearly any site.
In this example, we will use the Google Sheets function, “=IMPORTXML” which, as we know, takes two arguments: the URL and the XPath. You can find a template of the Google Sheet used in this case study here.
Open Google Chrome and navigate to the Craigslist results page for “cars & trucks” in the desired city (no other filters).
Right-click between the first two results > “inspect” to open Chrome’s dev tools. In dev tools, right-click the highlighted element (<ul class=”rows” should be highlighted in this example) > Copy > Copy XPath.
Note: Clicking between the first two results will access the XPath we need on any Craigslist results page, however, collecting data from other sites may have differing HTML structures that require the manual testing of different XPath elements to find the right source.
In Google Sheets, paste the copied value of the “XPath” into cell B2 (replace “Add XPath” in our template).
Note: Back in Chrome, notice the Craigslist results are paginated, showing listings 1-120 on page 1, with 3000 total results (total results may differ). We can use this information to conclude that there are 25 pages (3000/120). On page 2, the URL includes the additional path (?s=120). In cell D1 of your Google Sheet, input the value “120”, and continue to increment each column by 120 until you reach 3000 (E1 = 240, F1 = 360,…, Z1 = 2880).
Paste the URL (with “?s=”, without “120”) in cell A2 (replace “Add URL” in our template). In D2, we will use the Concatenate function to combine the URL in A2 with the value in D1. We will do this for each column value (240, 360, etc.) in row 1 (we will use the “$” to ensure the URL value remains consistent). Cell D2 should contain:
In C3, use the ImportXML function to import Craigslist data using the URL in A2 and the XPath in B2 as arguments (=IMPORTXML(A2, B2)). After a second, you should see the data from Craigslist populate row 3.
Note: We want the data to import vertically so we will transpose the data. Cell C3 should contain:
We will want the XPath to remain constant and the URL to change as we collect results for all 25 pages (~3000 listings). To do this we will again use the “$,” so cell C3 will contain:
You should now be able to expand this equation for the entire row and have ~3000 cells of data!
In case you were wondering, the average asking price for used vehicles in Billings, Montana is $21,951. In San Diego, the average asking price for used vehicles is $13,671.
How to Use Original Data in Content Marketing
Whether you are creating content in a boring industry or for the “hottest goss” in pop-culture, data and data visualization can level-up any content marketing strategy.
When searching for a unique angle for a piece of content, our job as content marketers is to tell a cohesive story that connects the dots between the data and the conclusion. Some examples of common uses for data are:
- Heat maps using state, country, or other location-specific data (example: Who Needs ETIAS from ETIAS)
- Statistics posts (example: 110 Must-Know Cybersecurity Statistics for 2020 from Varonis)
- Generators (example: Halloween Party Ideas from Shutterfly)
- Sortable table (example: Unicorn Startups from Embroker)
- Infographics (example: Do Looks Matter in Making Money? from Fundera)
Tips and Tricks to Clean Data in Google Sheets
It’s great that you now have thousands of cells of data, but that data doesn’t always import as clean as we would like our CSV files to be.
There are a few things you can do to quickly organize and clean up the data you just imported.
Copy and Paste
If the data you’ve imported is dynamic, meaning it will update at regular intervals or after a user-generated upload, the import will refresh every hour. This means that unless you copy and paste, the data will likely change on you.
If you are satisfied with the data being imported, select all cells containing data and use “[Command or CTRL] + C” to copy the data to the clipboard. In a clean sheet, use “[Command or CTRL] + shift + V” to paste without formatting.
If the data is imported horizontally in rows, but you need the data to be in vertical columns for your use case (or vice versa), you can use the =TRANSPOSE() function.
In the case of the ImportXML function or any import function for that matter, you can simply wrap your existing formula with “TRANSPOSE().”
For example, if you were importing a table using ImportHTML, your formula would look like this (A2 contains the source URL):
Parse Data using Split Text to Columns
If the data point you’re after is buried in the cell with other, undesirable data, there is a way to quickly sift for that nugget of data gold.
By highlighting the raw cells of interest, navigating to Data > Split Text To Columns, you will be presented with the option to select where you want the split to occur.
Many times, the data will already be structured in some way, either with commas, periods, semicolons, or another character that separates the data and is the perfect place for the split to occur. If that isn’t the case (as in our Craigslist example above) you can look to use a custom separator.
We noticed that after every price (our desired data) the word “favorite” consistently appeared. We used this word as our custom separator, which worked perfectly to isolate the price of vehicles being sold on Craigslist.
Go Forth and Collect Your Own Data
Data is like bacteria. It’s all around us, yet most of the time we don’t even notice it.
Okay, that’s a bad-teria joke. I’m done.
Now, the next time you brainstorm a content idea, but you’re unsure if you’ll be able to access the data, try using the ImportXML function in Google Sheets to easily find, collect, and organize original data.