6 Ways To Rapidly Collect Massive Datasets in your Apps

Do you have trouble collecting massive datasets in your apps? In this article, you’ll learn what web scraping is, how it is done, and how to use lightweight Python IDE windows tools for web scraping, web scraping results using Python4Delphi, and many more.

Table of Contents

What is web scraping?

Web Scraping is a technique where a computer program extracts data from human-readable output coming from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.

While Web Scraping can be done manually by a software user, the term typically refers to automated processes implemented using a program, bot, or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database, spreadsheet, API, or any format that is more useful for the user, for later retrieval or analysis.

First, the app needs to interpret a web page as data

Web pages are built using text-based mark-up languages; like HTML and XHTML, and frequently contain rich and useful data in text form. Quite obviously, most web pages are designed for human end-users and not really for ease of automated use. As a result, this can make it a challenging task to build specialized tools and software to facilitate the scraping of any web page.

Delphi plus Python is a powerful combination for web scraping

In this tutorial, we’ll build Windows Apps with extensive Web Scraping capabilities by integrating Python’s Web Scraping libraries with Embarcadero’s Delphi, using Python4Delphi (P4D).

P4D empowers Python users with Delphi’s award-winning VCL functionalities for Windows which enables us to build native Windows apps 5x faster. This integration enables us to create a modern GUI with Windows 10 looks and responsive controls for our Python Web Scraping applications. Python4Delphi also comes with an extensive range of demos, use cases, and tutorials.

We’re going to cover the following…

How to use Requests, BeautifulSoup, Instaloader, Snscrape, Tweepy, and Feedparser Python libraries to perform Web Scraping tasks

All of them would be integrated with Python4Delphi to create Windows Apps with Web Scraping capabilities.

Prerequisites

Before we begin to work, download and install the latest Python for your platform. Follow the Python4Delphi installation instructions mentioned here. Alternatively, you can check out the easy instructions found in the Getting Started With Python4Delphi video by Jim McKeeth.

Time to get started!

First, open and run our Python GUI using project Demo1 from Python4Delphi with RAD Studio. Then insert the script into the lower Memo, click the Execute button, and get the result in the upper Memo. You can find the Demo1 source on GitHub. The behind the scene details of how Delphi manages to run your Python code in this amazing Python GUI can be found at this link.

python4delphi run demo01 — Open Demo01dproj

How do I Scrape Website’s Data using Python Requests?

“Requests” is a simple, yet elegant HTTP library. Requests allow you to execute standard HTTP requests extremely easily. Using this library, you can pass parameters to requests, add headers, receive and process responses, execute authenticated requests.

Requests are ready for the demands of building robust and reliable HTTP–speaking applications, for the needs of today.

Keep-Alive & Connection Pooling
International Domains and URLs
Sessions with Cookie Persistence
Browser-style TLS/SSL Verification
Basic & Digest Authentication
Familiar dict–like Cookies
Automatic Content Decompression and Decoding
Multi-part File Uploads
SOCKS Proxy Support
Connection Timeouts
Streaming Downloads
Automatic honoring of .netrc
Chunked HTTP Requests

After installing Python4Delphi properly, you can get Requests using pip or easy install to your command prompt:

pip install requests

1	pip install requests

and don’t forget to put the path where your Requests installed, to the System Environment Variables

Example System Environment Variables

C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38/Lib/site-packages
C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38/Scripts
C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38

C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38/Lib/site-packages

C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38/Scripts

C:/Users/YOUR_USERNAME/AppData/Local/Programs/Python/Python38

The following is a code example of Requests to get content, status, and list of response headers (run this inside the lower Memo of Python4Delphi Demo01 GUI):

Example Python Requests

import requests

r = requests.get('https://example.com')

print(r.text)
print(r.headers)
print(r.status_code)

import requests

r = requests.get('https://example.com')

print(r.text)

print(r.headers)

print(r.status_code)

Here is the result in Python GUI

Requests Demo with Python4Delphi in Windows

Requests is one of the most downloaded Python packages today, pulling in around 14M downloads/week—according to GitHub, Requests is currently depended upon by 500,000+ repositories. Knowing these facts, you may certainly put your trust in this credible library.

How do I Scrape Websites using Python BeautifulSoup?

BeautifulSoup is a library that makes it easy to scrape information from web pages. It sits built on top of an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. Since 2004, BeautifulSoup has been saving programmers hours or days of work on quick-turnaround screen scraping projects.

BeautifulSoup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

BeautifulSoup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application
BeautifulSoup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding.
BeautifulSoup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

.Are you looking for tools to build website scrapers to automate your data collecting process, and build a nice Windows GUI for them? This section will show you how to get started!

Here is how you can get BeautifulSoup

pip install beautifulsoup4

1	pip install beautifulsoup4

Example of using Python BeautifulSoup to collect and gather weather data

The following is an example of BeautifulSoup for scraping the Austin/San Antonio, TX weather data from the National Weather Service (run this inside the lower Memo of Python4Delphi Demo01 GUI):

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Read url
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=30.2676&lon=-97.743")

# Download the page and start parsing
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]

# Extract the name of the forecast item, the short description, and the temperature
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

# Extract the title attribute from the img tag
img = tonight.find("img")
desc = img['title']

# Extracting all the information from the page
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

# Combining our data into a Pandas Dataframe
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})

# Print the dataframe
print(weather)

from bs4 import BeautifulSoup

import requests

import pandas as pd

# Read url

page = requests.get("https://forecast.weather.gov/MapClick.php?lat=30.2676&lon=-97.743")

# Download the page and start parsing

soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")

tonight = forecast_items[0]

# Extract the name of the forecast item, the short description, and the temperature

period = tonight.find(class_="period-name").get_text()

short_desc = tonight.find(class_="short-desc").get_text()

temp = tonight.find(class_="temp").get_text()

# Extract the title attribute from the img tag

img = tonight.find("img")

desc = img['title']

# Extracting all the information from the page

period_tags = seven_day.select(".tombstone-container .period-name")

periods = [pt.get_text() for pt in period_tags]

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]

temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]

descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

# Combining our data into a Pandas Dataframe

weather = pd.DataFrame({

"period": periods,

"short_desc": short_descs,

"temp": temps,

"desc":descs

})

# Print the dataframe

print(weather)

Here is the BeautifulSoup result in the Python GUI:

BeautifulSoup Demo with Python4Delphi in Windows

How do I Scrape Instagram Data using Python Instaloader?

“Instaloader” is a tool to download Instagram pictures (or videos) and retrieve their captions and other metadata.

The following are Instaloader features and functionalities:

Downloads public and private profiles, hashtags, user stories, feeds, and saved media.
Downloads comments, geotags, and captions of each post.
Automatically detects profile name changes and renames the target directory accordingly.
Allows fine-grained customization of filters and where to store downloaded media.
Automatically resumes previously interrupted download iterations.

Are you looking for tools to build Instagram scrapers to automate your data collecting or information retrieval process, and build a nice GUI for them? This section will show you how to get started!

This section will guide you to combine Python4Delphi with the Instaloader library, inside Delphi and C++Builder, from installing Instaloader with pip to downloading all @embarcaderotech Instagram content instantly!

Here is how you can get Instaloader

pip install instaloader

1	pip install instaloader

Python’s Instaloader allows us to get any posts from any public profile easily. We just need to use the get_posts() method. We will use this method on the profile of @embarcaderotech.

How to download Instagram posts using Delphi and Python

Let’s download each post image/video and caption, by looping over the generator object using the .download_post() method. Run the following script in Python4Delphi GUI:

# Import the module
import instaloader

# Create an instance of Instaloader class
bot = instaloader.Instaloader()

# Load a profile from an Instagram handle
#profile = instaloader.Profile.from_username(bot.context, 'embarcaderotech')

# Get all posts in a generator object
posts = profile.get_posts()

# Iterate and download
for index, post in enumerate(posts, 1):
    bot.download_post(post, target=f"{profile.username}_{index}")

# Import the module

import instaloader

# Create an instance of Instaloader class

bot = instaloader.Instaloader()

# Load a profile from an Instagram handle

#profile = instaloader.Profile.from_username(bot.context, 'embarcaderotech')

# Get all posts in a generator object

posts = profile.get_posts()

# Iterate and download

for index, post in enumerate(posts, 1):

bot.download_post(post, target=f"{profile.username}_{index}")

It will save the post and create new folders with folder name “embarcadero_1” until “embarcadero_n”, inside the directory that contains Python4Delphi Demo01.exe that we use to run all the scripts above.

In each folder, you will see the actual content of the posts of the profile like a video or images. The scripts above are taken and modified from this post.

Instaloader Python4Delphi results

There are a lot of results so it’s not very screenshot-friendly!

Instaloader Demo with Python4Delphi in Windows

Where do the screen-scraping results go?

All the contents will be retrieved automatically to the directory where you save or run the Python4Delphi Demo01 GUI:

You will get all the contents you need in no time (compared with the hard work to download them manually)!

Here are the contents of each folder look like:

How do I Retrieve Twitter Data using Python Snscrape?

“Snscrape” is a library that allows anyone to scrape social networking services (SNS) without requiring personal API keys. It can return thousands of user profiles, hashtags, contents, or searches in seconds and has powerful and highly customizable tools.

The following services are currently supported:

Facebook: User profiles, groups, and communities (aka visitor posts)
Instagram: User profiles, hashtags, and locations
Reddit: Users, subreddits, and searches (via Pushshift)
Telegram: Channels. Know why you should use Telegram Messanger in your own applications for added security in this article.
Twitter: Users, user profiles, hashtags, searches, threads, and list posts
VKontakte: User profiles
Weibo (Sina Weibo): User profiles

In this tutorial, we will only focus on using Python Snscrape for Twitter.

How do I get the Python Snscrape library?

pip install snscrape

1	pip install snscrape

Run the following command in cmd to get all tweets by Embarcadero Technologies (@EmbarcaderoTech):

snscrape twitter-user EmbarcaderoTech > twitter-@EmbarcaderoTech.txt

1	snscrape twitter-user EmbarcaderoTech > twitter-@EmbarcaderoTech.txt

These scraping results would be stored in [email protected] file:

How do I Retrieve Twitter Data using Python Tweepy?

“Tweepy” is an easy-to-use Python library for accessing the Twitter API.

There are limitations in using Tweepy for scraping tweets. The standard API only allows you to retrieve tweets up to 7 days ago and is limited to scraping 18,000 tweets per 15-minute window. But, combining Tweepy with Snscrape, can enable you to bypass the API limitations, make it possible for you to scrape all the tweets you want, as long as their URLs are already scraped and stored in .txt files, as shown in the previous section!

Getting started with Python Tweepy

To get started with Tweepy you’ll need to do the following things:

Set up a Twitter account if you don’t have one already.
Using your Twitter account, you will need to apply for Developer Ac c ess and then create an application that will generate the API credentials that you will use to access Twitter from Python.
Install and import the Tweepy package.

Once you’ve done these things, you are ready to begin querying Twitter’s API to see what you can learn about tweets!

Run this pip command to install Tweepy:

pip install tweepy

1	pip install tweepy

Example Python code showing how to retrieve Twitter tweets

The following is a code to use Tweepy to retrieve all @EmbarcaderoTech tweets as listed in the section 4 (run this inside the lower Memo of Python4Delphi Demo01 GUI):

# Import libraries
import pandas as pd, tweepy

# Key & access tokens
consumer_key = "YOUR CONSUMER KEY"
consumer_secret = "YOUR CONSUMER SECRET"
access_token = "YOUR ACCESS TOKEN"
access_token_secret = "YOUR ACCESS TOKEN SECRET"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Open your text file/snscrape output
tweet_url = pd.read_csv("twitter-@EmbarcaderoTech.txt", index_col= None, header = None, names = ["links"])
print(tweet_url.head())

# Extract the tweet_id using .split function
af = lambda x: x["links"].split("/")[-1]
tweet_url['id'] = tweet_url.apply(af, axis=1)
print(tweet_url.head())

# Convert our tweet_url Series into a list
ids = tweet_url['id'].tolist()

# Process the ids by batch or chunks.
total_count = len(ids)
chunks = (total_count - 1) // 50 + 1

# Username, date and the tweet themselves, so my code will only include those queries.
def fetch_tw(ids):
    list_of_tw_status = api.statuses_lookup(ids, tweet_mode= "extended")
    empty_data = pd.DataFrame()
    for status in list_of_tw_status:
        tweet_elem = {"tweet_id": status.id,
                      "screen_name": status.user.screen_name,
                      "Tweet":status.full_text,
                      "Date":status.created_at,
                      "retweet_count": status.retweet_count,
                      "favorite_count": status.favorite_count}
        empty_data = empty_data.append(tweet_elem, ignore_index = True)
    empty_data.to_csv("embarcaderoTech_Tweets.csv", mode="a")

# Create another for loop to loop into our batches while processing 50 entries every loop
for i in range(chunks):
    batch = ids[i*50:(i+1)*50]
    result = fetch_tw(batch)

# Import libraries

import pandas as pd, tweepy

# Key & access tokens

consumer_key = "YOUR CONSUMER KEY"

consumer_secret = "YOUR CONSUMER SECRET"

access_token = "YOUR ACCESS TOKEN"

access_token_secret = "YOUR ACCESS TOKEN SECRET"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

# Open your text file/snscrape output

tweet_url = pd.read_csv("[email protected]", index_col= None, header = None, names = ["links"])

print(tweet_url.head())

# Extract the tweet_id using .split function

af = lambda x: x["links"].split("/")[-1]

tweet_url['id'] = tweet_url.apply(af, axis=1)

print(tweet_url.head())

# Convert our tweet_url Series into a list

ids = tweet_url['id'].tolist()

# Process the ids by batch or chunks.

total_count = len(ids)

chunks = (total_count - 1) // 50 + 1

# Username, date and the tweet themselves, so my code will only include those queries.

def fetch_tw(ids):

list_of_tw_status = api.statuses_lookup(ids, tweet_mode= "extended")

empty_data = pd.DataFrame()

for status in list_of_tw_status:

tweet_elem = {"tweet_id": status.id,

"screen_name": status.user.screen_name,

"Tweet":status.full_text,

"Date":status.created_at,

"retweet_count": status.retweet_count,

"favorite_count": status.favorite_count}

empty_data = empty_data.append(tweet_elem, ignore_index = True)

empty_data.to_csv("embarcaderoTech_Tweets.csv", mode="a")

# Create another for loop to loop into our batches while processing 50 entries every loop

for i in range(chunks):

batch = ids[i*50:(i+1)*50]

result = fetch_tw(batch)

Using Python and Tweepy for powerful Twitter scraping

Tweepy Demo with Python4Delphi in Windows

Tweepy Twitter scraping results in an Excel spreadsheet

We successfully scrape all @EmbarcaderoTech tweets, from 2009 until current tweets, and we store it to “embarcaderoTech_Tweets.csv” file

How do I Pull RSS Feed Data using Feedparser?

“Feedparser” or Universal Feed Parser is a library to parse Atom and RSS feeds in Python. feedparser can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Apple’s iTunes extensions.

To use feedparser, you will need Python 3.6 or later. feedparser is not meant to run standalone; it is a module for you to use as part of a larger Python program.

feedparser is easy to use; the module is self-contained in a single file, feedparser.py, and it has only one primary public function, parse. parse takes a number of arguments, but only one is required, and it can be a URL, a local filename, or a raw string containing feed data in any format.

Here is how you can get feedparser

pip install feedparser

1	pip install feedparser

Run the following script to parsing data from Stack Overflow RSS Feed:

import feedparser

d = feedparser.parse('http://stackoverflow.com/feeds')

print(d.feed.title)
print(d.feed.title_detail)
print(d.feed.link)
print(d.entries)

import feedparser

d = feedparser.parse('http://stackoverflow.com/feeds')

print(d.feed.title)

print(d.feed.title_detail)

print(d.feed.link)

print(d.entries)

Feedparser results in the Python4Delphi GUI

It’s a lot of data, again!

Feedparser Demo with Python4Delphi in Windows

Want to know some more? Then check out Python4Delphi which easily allows you to build Python GUIs for Windows using Delphi.

Reduce development time and get to market faster with RAD Studio, Delphi, or C++Builder.
Design. Code. Compile. Deploy.
Start Free Trial Upgrade Today

Free Delphi Community Edition Free C++Builder Community Edition

6 Ways To Rapidly Collect Massive Datasets in your Apps

Leave a ReplyCancel reply

Search

Something Fresh

What You Can Do With RAD Studio 12.2

Faster Delphi RTL with Parallel Arrays and Ordered Dictionaries

Focus Mode in RAD Studio 12.2: Just You and Your Code

Popular Posts

Announcing the Availability of RAD Studio 12.2 Athens

Delphi 12 And C++Builder 12 Community Editions Released!

InterBase ODBC Driver on GitHub

Embarcadero Partners with Raize Software for KSVC Maintenance

New in RAD Studio 12.1: Split Editor Views!

Categories

Unknown Feed

Unknown Feed

Categories

Useful Links

Follow us

6 Ways To Rapidly Collect Massive Datasets in your Apps

What is web scraping?

First, the app needs to interpret a web page as data

Delphi plus Python is a powerful combination for web scraping

How to use Requests, BeautifulSoup, Instaloader, Snscrape, Tweepy, and Feedparser Python libraries to perform Web Scraping tasks

Prerequisites

Time to get started!

How do I Scrape Website’s Data using Python Requests?

Example System Environment Variables

Example Python Requests

Here is the result in Python GUI

How do I Scrape Websites using Python BeautifulSoup?

Here is how you can get BeautifulSoup

Example of using Python BeautifulSoup to collect and gather weather data

Here is the BeautifulSoup result in the Python GUI:

How do I Scrape Instagram Data using Python Instaloader?

Here is how you can get Instaloader

How to download Instagram posts using Delphi and Python

Instaloader Python4Delphi results

Where do the screen-scraping results go?

How do I Retrieve Twitter Data using Python Snscrape?

How do I get the Python Snscrape library?

How do I Retrieve Twitter Data using Python Tweepy?

Getting started with Python Tweepy

Example Python code showing how to retrieve Twitter tweets

Using Python and Tweepy for powerful Twitter scraping

Tweepy Twitter scraping results in an Excel spreadsheet

How do I Pull RSS Feed Data using Feedparser?

Here is how you can get feedparser

Feedparser results in the Python4Delphi GUI

Leave a ReplyCancel reply

Join Our Global Developer Community

Search

Something Fresh

Popular Posts

Categories

Categories

Useful Links

Follow us