To better understand how you can take advantage of the data scraping functionality, let's create an automation project that extracts some specific information from Wikipedia and writes it to an Excel spreadsheet. You can use this type of automation in different scenarios, such as extracting lists of products and their prices from e-commerce websites.
It is recommended to run your web automations on Internet Explorer 11 and above, Mozilla Firefox 50 or above, or the latest version of Google Chrome.
Let’s say you want to start reading up on economics and you want to get a list of Wikipedia articles on the subject, together with their URLs, and the additional information that is provided in the search results for each article. You can do the following:
- Open Internet Explorer and navigate to en.wikipedia.org.
- In the Search Wikipedia box, type "economics", then click "containing... economics" in the drop-down that appears. A web page opens displaying the search results.
- In Studio, create a New Blank Process.
- From the Activities panel, add an Open Browser activity to the Designer panel and, in the Url field, paste the URL of the web page with the search results. In our example, the URL is:
- In the Design ribbon tab, in the Wizards group, click Data Scraping. The Extract Wizard is displayed.
- Following the wizard, select the first and last items on the web page. The Configure Columns wizard step is displayed and the fields you selected are highlighted in the web browser.
- Select the Extract URL check box and change the name of the column headers to something relevant, for example "Article Title" and "URL".
- Click Next. A preview of the data is displayed in the Preview Data wizard step. Note that because the Wikipedia page uses relative URLs, the URL column contains relative URLs as well. You can correct this in the Excel output after the project is executed by adding the string "https://en.wikipedia.org" at the beginning of each cell in the URL column.
- Click the Extract Correlated Data button to extract additional information about the articles. The Extract Wizard starts again.
- Following the wizard again, indicate the information about the size and date of the last edit that is available for each article. The Configure Columns step is reached again.
- Change the name of the new column header to "Additional Information" and click Next. The data is displayed in the Preview Data wizard step. Optionally, you can change the order of the columns by dragging them in place.
- In the Maximum number of results type 60. The Wikipedia search lists 20 results per page and, for our example, we want to extract the first three pages of search results.
- Click Finish. The Indicate Next Link window is displayed prompting you to indicate the Next button or arrow to click if the data spans more than one page.
- Click Yes and select the next 20 button below the search results in Wikipedia. The project is updated and a Data Scraping sequence is displayed in the Designer panel. A DataTable variable,
ExtractDataTablehas been automatically generated.
- In the Variables panel, change the scope of the automatically generated
ExtractDataTablevariable to Sequence. Do this to make the variable available outside of its current scope, the Data Scraping sequence.
- Add an Excel Application Scope activity under the Data Scraping sequence.
- In the Properties panel of the Excel Application Scope activity, in the WorkbookPath field, type
"web_scraping.xlsx". Upon project execution, a file with this name is created in the project folder to store data from the scraping. Alternatively, you can specify a file that already exists on your machine.
- In the Do sequence of the Excel Application Scope activity, add a Write Range activity and in the Properties panel:
- In the DataTable field, add the
- Select the AddHeaders check box to include the column names in the output.
The final project should look as in the following screenshot:
- Press F5 to execute the project.
- Open the Excel file you defined in step 17. Note that all columns are populated correctly.
Updated about a year ago