6 Ways To Rapidly Collect Massive Datasets in your Apps

Muhammad Azizul Hakim

4 years ago

Do you have trouble collecting massive datasets in your apps? In this article, you’ll learn what web scraping is, how it is done, and how to use lightweight Python IDE windows tools for web scraping, web scraping results using Python4Delphi, and many more.

Table of Contents

What is web scraping?

Web Scraping is a technique where a computer program extracts data from human-readable output coming from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.

While Web Scraping can be done manually by a software user, the term typically refers to automated processes implemented using a program, bot, or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database, spreadsheet, API, or any format that is more useful for the user, for later retrieval or analysis.

First, the app needs to interpret a web page as data

Web pages are built using text-based mark-up languages; like HTML and XHTML, and frequently contain rich and useful data in text form. Quite obviously, most web pages are designed for human end-users and not really for ease of automated use. As a result, this can make it a challenging task to build specialized tools and software to facilitate the scraping of any web page.

Delphi plus Python is a powerful combination for web scraping

In this tutorial, we’ll build Windows Apps with extensive Web Scraping capabilities by integrating Python’s Web Scraping libraries with Embarcadero’s Delphi, using Python4Delphi (P4D).

P4D empowers Python users with Delphi’s award-winning VCL functionalities for Windows which enables us to build native Windows apps 5x faster. This integration enables us to create a modern GUI with Windows 10 looks and responsive controls for our Python Web Scraping applications. Python4Delphi also comes with an extensive range of demos, use cases, and tutorials.

We’re going to cover the following…

How to use Requests, BeautifulSoup, Instaloader, Snscrape, Tweepy, and Feedparser Python libraries to perform Web Scraping tasks

All of them would be integrated with Python4Delphi to create Windows Apps with Web Scraping capabilities.

Prerequisites

Before we begin to work, download and install the latest Python for your platform. Follow the Python4Delphi installation instructions mentioned here. Alternatively, you can check out the easy instructions found in the Getting Started With Python4Delphi video by Jim McKeeth.

Time to get started!

First, open and run our Python GUI using project Demo1 from Python4Delphi with RAD Studio. Then insert the script into the lower Memo, click the Execute button, and get the result in the upper Memo. You can find the Demo1 source on GitHub. The behind the scene details of how Delphi manages to run your Python code in this amazing Python GUI can be found at this link.

How do I Scrape Website’s Data using Python Requests?

“Requests” is a simple, yet elegant HTTP library. Requests allow you to execute standard HTTP requests extremely easily. Using this library, you can pass parameters to requests, add headers, receive and process responses, execute authenticated requests.

Requests are ready for the demands of building robust and reliable HTTP–speaking applications, for the needs of today.

Keep-Alive & Connection Pooling
International Domains and URLs
Sessions with Cookie Persistence
Browser-style TLS/SSL Verification
Basic & Digest Authentication
Familiar dict–like Cookies
Automatic Content Decompression and Decoding
Multi-part File Uploads
SOCKS Proxy Support
Connection Timeouts
Streaming Downloads
Automatic honoring of .netrc
Chunked HTTP Requests

After installing Python4Delphi properly, you can get Requests using pip or easy install to your command prompt:

[crayon-68ce9c40dc95e806364377/]

and don’t forget to put the path where your Requests installed, to the System Environment Variables

Example System Environment Variables

[crayon-68ce9c40dc963042697851/]

The following is a code example of Requests to get content, status, and list of response headers (run this inside the lower Memo of Python4Delphi Demo01 GUI):

Example Python Requests

[crayon-68ce9c40dc965949730648/]

Here is the result in Python GUI

Requests Demo with Python4Delphi in Windows

Requests is one of the most downloaded Python packages today, pulling in around 14M downloads/week—according to GitHub, Requests is currently depended upon by 500,000+ repositories. Knowing these facts, you may certainly put your trust in this credible library.

How do I Scrape Websites using Python BeautifulSoup?

BeautifulSoup is a library that makes it easy to scrape information from web pages. It sits built on top of an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. Since 2004, BeautifulSoup has been saving programmers hours or days of work on quick-turnaround screen scraping projects.

BeautifulSoup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

BeautifulSoup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application
BeautifulSoup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding.
BeautifulSoup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

.Are you looking for tools to build website scrapers to automate your data collecting process, and build a nice Windows GUI for them? This section will show you how to get started!

Here is how you can get BeautifulSoup

[crayon-68ce9c40dc966112628429/]

Example of using Python BeautifulSoup to collect and gather weather data

The following is an example of BeautifulSoup for scraping the Austin/San Antonio, TX weather data from the National Weather Service (run this inside the lower Memo of Python4Delphi Demo01 GUI):

[crayon-68ce9c40dc967654111890/]

Here is the BeautifulSoup result in the Python GUI:

BeautifulSoup Demo with Python4Delphi in Windows

How do I Scrape Instagram Data using Python Instaloader?

“Instaloader” is a tool to download Instagram pictures (or videos) and retrieve their captions and other metadata.

The following are Instaloader features and functionalities:

Downloads public and private profiles, hashtags, user stories, feeds, and saved media.
Downloads comments, geotags, and captions of each post.
Automatically detects profile name changes and renames the target directory accordingly.
Allows fine-grained customization of filters and where to store downloaded media.
Automatically resumes previously interrupted download iterations.

Are you looking for tools to build Instagram scrapers to automate your data collecting or information retrieval process, and build a nice GUI for them? This section will show you how to get started!

This section will guide you to combine Python4Delphi with the Instaloader library, inside Delphi and C++Builder, from installing Instaloader with pip to downloading all @embarcaderotech Instagram content instantly!

Here is how you can get Instaloader

[crayon-68ce9c40dc968460596681/]

Python’s Instaloader allows us to get any posts from any public profile easily. We just need to use the get_posts() method. We will use this method on the profile of @embarcaderotech.

How to download Instagram posts using Delphi and Python

Let’s download each post image/video and caption, by looping over the generator object using the .download_post() method. Run the following script in Python4Delphi GUI:

[crayon-68ce9c40dc96a077194371/]

It will save the post and create new folders with folder name “embarcadero_1” until “embarcadero_n”, inside the directory that contains Python4Delphi Demo01.exe that we use to run all the scripts above.

In each folder, you will see the actual content of the posts of the profile like a video or images. The scripts above are taken and modified from this post.

Instaloader Python4Delphi results

There are a lot of results so it’s not very screenshot-friendly!

Instaloader Demo with Python4Delphi in Windows

Where do the screen-scraping results go?

All the contents will be retrieved automatically to the directory where you save or run the Python4Delphi Demo01 GUI:

You will get all the contents you need in no time (compared with the hard work to download them manually)!

Here are the contents of each folder look like:

How do I Retrieve Twitter Data using Python Snscrape?

“Snscrape” is a library that allows anyone to scrape social networking services (SNS) without requiring personal API keys. It can return thousands of user profiles, hashtags, contents, or searches in seconds and has powerful and highly customizable tools.

The following services are currently supported:

Facebook: User profiles, groups, and communities (aka visitor posts)
Instagram: User profiles, hashtags, and locations
Reddit: Users, subreddits, and searches (via Pushshift)
Telegram: Channels. Know why you should use Telegram Messanger in your own applications for added security in this article.
Twitter: Users, user profiles, hashtags, searches, threads, and list posts
VKontakte: User profiles
Weibo (Sina Weibo): User profiles

In this tutorial, we will only focus on using Python Snscrape for Twitter.

How do I get the Python Snscrape library?

[crayon-68ce9c40dc96b091210677/]

Run the following command in cmd to get all tweets by Embarcadero Technologies (@EmbarcaderoTech):

[crayon-68ce9c40dc96c484907752/]

These scraping results would be stored in twitter-@EmbarcaderoTech.txt file:

How do I Retrieve Twitter Data using Python Tweepy?

“Tweepy” is an easy-to-use Python library for accessing the Twitter API.

There are limitations in using Tweepy for scraping tweets. The standard API only allows you to retrieve tweets up to 7 days ago and is limited to scraping 18,000 tweets per 15-minute window. But, combining Tweepy with Snscrape, can enable you to bypass the API limitations, make it possible for you to scrape all the tweets you want, as long as their URLs are already scraped and stored in .txt files, as shown in the previous section!

Getting started with Python Tweepy

To get started with Tweepy you’ll need to do the following things:

Set up a Twitter account if you don’t have one already.
Using your Twitter account, you will need to apply for Developer Ac c ess and then create an application that will generate the API credentials that you will use to access Twitter from Python.
Install and import the Tweepy package.

Once you’ve done these things, you are ready to begin querying Twitter’s API to see what you can learn about tweets!

Run this pip command to install Tweepy:

[crayon-68ce9c40dc96d475662040/]

Example Python code showing how to retrieve Twitter tweets

The following is a code to use Tweepy to retrieve all @EmbarcaderoTech tweets as listed in the section 4 (run this inside the lower Memo of Python4Delphi Demo01 GUI):

[crayon-68ce9c40dc96e242901628/]

Using Python and Tweepy for powerful Twitter scraping

Tweepy Demo with Python4Delphi in Windows

Tweepy Twitter scraping results in an Excel spreadsheet

We successfully scrape all @EmbarcaderoTech tweets, from 2009 until current tweets, and we store it to “embarcaderoTech_Tweets.csv” file

How do I Pull RSS Feed Data using Feedparser?

“Feedparser” or Universal Feed Parser is a library to parse Atom and RSS feeds in Python. feedparser can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Apple’s iTunes extensions.

To use feedparser, you will need Python 3.6 or later. feedparser is not meant to run standalone; it is a module for you to use as part of a larger Python program.

feedparser is easy to use; the module is self-contained in a single file, feedparser.py, and it has only one primary public function, parse. parse takes a number of arguments, but only one is required, and it can be a URL, a local filename, or a raw string containing feed data in any format.

Here is how you can get feedparser

[crayon-68ce9c40dc96f949472676/]

Run the following script to parsing data from Stack Overflow RSS Feed:

[crayon-68ce9c40dc970337830810/]

Feedparser results in the Python4Delphi GUI

It’s a lot of data, again!

Feedparser Demo with Python4Delphi in Windows

Want to know some more? Then check out Python4Delphi which easily allows you to build Python GUIs for Windows using Delphi.