Use Power BI to Collect Unstructured Data from the Web

Follow the steps in this article to learn how to scrape data from websites, structure it in tables, and use it in a Power BI report. In this example, I show how to create a Power BI calendar of events that draws data from two publicly accessible websites.

This is one of my favorite ways to use Power BI and I’m always surprised at how few people are aware of this amazing capability. You can see an example of this at my other site: https://www.d365calendar.com/about/powerbi-calendar/

First, set up Power BI Desktop

Download Power BI Desktop (not to be confused with Power BI Report Builder, which is the new name for the tool used to build SQL Server Reporting Services Reports and creates RDL files). To download Power BI Desktop, go to powerbi.microsoft.com and click on the Products menu, or open the Windows Store on your machine.

Second, get the data
  1. Click on “Get Data” > Web
  2. In the “From Web” dialog, select “Basic” and enter the URL. I’m pulling data from the D365 User Group’s list of events on their website: https://www.d365ug.com/events/calendar
  3. After connecting, click the button “Add Table Using Examples”
  4. This shows you a preview of the web page where you can begin to select information to place in columns to build a table:
  5. Place your cursor in the first row and begin typing the text that represents an example of the information you want to capture in that column:
  6. You can type a couple of examples if necessary to make sure the Power BI is grabbing the correct info. Tab away and it will fill in the rest of the column.
  7. Rename the column. I like to use a convention that describes the source of the information.
  8. Then tab to create the second column and repeat the step for the next bit of information. Sometimes the webpage will not have the data broken out in a way that makes it easy to parse just the bits that you want. In this example, I just want the date, but I’m getting a long string with the word “When”, a colon and a bunch of empty spaces.
  9. Once you have selected the raw data from the website that you want for your table, click OK.

Tip: You can take a look at the webpage that you are using as a datasource in a browser and use the browser tools to see which elements of the page might be parseable by Power BI. Power BI will look for CSS IDs and selectors to distinguish the elements that make up the page:

Next, Transform your Data
  1. Back on the Navigator screen, click “Transform Data”:
  2. In the Transform window, I am going to use some transformations to get the data I am interested in. For this first column, I will Split the string by the positions of the characters in the string:
  3. When I do that, Power BI looks at the string and automatically shows me where each part of the string starts:

    So if we look at that information, we can see that position 0 must be where the word “When:” starts, and position 104 must be where the date starts. 108 must be the day, 111 must be the year, and so on. I don’t want to break the date out into its sub-parts, so I am going to just split it at positions 0 and 104:
  4. This gives me a new column that just starts at position 104:
    Tip: If I need to undo a step because I mess something up, I can delete it on the right:
  5. Next, I want to just get the date out of my new column. I am going to Split the new column by a delimiter. I don’t want the stuff that starts with the word “from” so I can use that word as a custom delimiter:
  6. Check out what happened here, automatically. After I split that column out, Power BI automatically recognized that this data is a Date type. I can tell by the icon at the top of the new column:
  7. I do a similar “Split by Position” to parse out the location and rename that column. It took some trial and error to find the exact position that the location starts. It turned out that I wanted to split by positions 0 and 117. Then I renamed the column to D365UG_Location:
  8. When I’m satisfied with my table, I click on “Close and Apply”:
Rinse and Repeat

I want to combine this data with another data source, so let’s try another web page. I will repeat the above process with data from https://www.365portal.org/events/.

Now I have two tables with data I can use in my report.

I can continue to add more in the same manner, or even connecting to other types of data. I’ve also gone ahead and connected some data from CDS.

Normalize

Before I proceed, I am actually going to rename the columns so that across the three different data sources, they are uniform for Name, Date and Location. That will let me use another feature of Power BI to create a new table that combines all the data I scraped from the websites and pulled in from CDS.

  1. I want to combine these different data sources into one new table. So in the Power Query editor, I click on “Append > Append as New” and select the three source tables:
  2. This creates a new table for me that has all the columns from all of my data sources. I have removed the columns that I am not interested in for the next step, so all I am left with is Name, Date and Location. I have also renamed this new table to “Combined”:
  3. Click “Close and Apply” again
Design your report
  1. Now I can begin to design what I want this to look like. I want to add a new control to display this data, so I am going to import it from the Power BI marketplace:
    I’ve opted to use the MAQ Calendar since it has a decent layout and navigation to see the calendar by day, week or month.
  2. Lastly, I am going to embed this on a portal using a simple iframe:
  3. On my portal website, d365calendar.com, (which runs on Microsoft’s Power Portals by the way!) I’ve added a page and in the HTML, I paste the iframe code from Power BI:
  4. And here is what it looks like live on the site:

These are some of my favorite tricks with Power BI, and I hope you’ll enjoy using them now as well. Leave a comment below and tell me what you think!

Published by Matt Wittemann

MVP, Microsoft Business Applications Senior Solution Architect Author, speaker and trainer

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: