The University Utrecht aimed to collect website text data of start-up companies. This data was then analyzed to identify which of these companies are developing sustainable technologies and business models. The webpages of 80.000 start-up companies needed to be collected. Assessing the level of sustainability by manually looking up company websites was too time-consuming and error-prone. Therefore, a web scrape solution was built to automatically gather text data from the list of companies. The research engineering team of the University of Utrecht had experience, from previous projects, in building a web scraper on the public cloud. However, the setup consisted of many servers which were difficult to maintain and monitor. In addition, transferring the data to local infrastructure was time-consuming.
Run applications without managing servers
The cloud solution architects of SURF redesigned the current web scraping solution to solve the before mentioned issues. The new solution consisted of serverless components. Serverless is one of the advantages of public clouds, it allows developers to build and run applications without having to manage any servers, which improves maintainability. Serverless computing is highly scalable and comes with a pay-as-you-go model. Depending on the use case it can be a good fit, when the workload is volatile or cyclical, or a bad fit, when the workload is consistent over time. Since the scraping will happen infrequently and the number of domains varies, a serverless solution was a good fit.
Little setup time, a large free tier, and avoiding common pitfalls
The implementation of the new architecture was done by SURF in close collaboration with the University of Utrecht engineers and the researcher (PhD candidate J. Leendertse). During the implementation, several lessons were learned. First, the integration of the services required relatively little effort to set up. Most time was spent on fine-tuning the configuration and finding the most suitable services for this use case. A misconfiguration would immediately lead to a slower or more expensive solution. Close monitoring of the logs, metrics, and costs is vital in a public cloud environment to identify any misconfiguration and avoid unwanted costs or security risks. Second, these serverless services come with a relatively large free tier, which is renewed every month. This covered a large part of the implementation, testing, and even part of the final scraping. Finally, experience building cloud solutions help avoid common pitfalls, identify misconfiguration early, and utilize the public cloud cost-effectively and securely.
22 million pages to analyze
With the final solution, all the website content was successfully collected, preprocessed, transferred, and ready to be analyzed. In total, the 80.000 domains resulted in more than 22 million web pages being scraped. The researcher analyzed 50 GB of text data and identified which start-ups are environmentally sustainable and in which European regions they are located. Next, identifying which regional factors determine the founding of sustainable startups. These insights can be adopted by regional governments to create a supportive ecosystem for sustainable entrepreneurship. The research paper is submitted for publication and is pending review.
SURF Public Cloud Call to learn how to effectively utilize cloud resources for research
This project was part of the public cloud for research call by SURF. In this call, SURF's cloud solution architects help develop a Proof of Concept for research projects in the public cloud. The goal of this call is twofold: first, to provide public cloud services to researchers to accelerate research and, second, to gain and share expertise on how to effectively utilize cloud resources for research. Are you working on a data-intensive research project that could benefit from the public cloud services and you want to know more about the public cloud call? Please contact us, for more information: https://www.surf.nl/en/call-public-cloud-for-research
This article has been written by Robert Jan Bood, Machiel Jansen, and Mariël Oolthuis.
More information on the research and researcher can be found here: https://www.uu.nl/staff/JLeendertse/Profile
For more technical information on the scrape pipeline, see GitHub: https://github.com/UtrechtUniversity/ia-webscraping