Quickly Parse HTML And XML With BeautifulSoup Python Library In Delphi And C++ Windows Apps

1042640laserNetworkCanvasLandscape | Embarcadero RAD Studio Delphi C++Builder Blogs

We know how to load and display Web content or local files in Delphi using TWebBrowser. It offers support for the basic functions of a browser, such as navigate to URL, go back, go forward, along with specific events. How about the web scrapping in Delphi using the Python BeautifulSoup library? Sounds Interesting? Yes, with the help of Python4Delphi we can scrap the web pages quickly in the Delphi/C++ Builder app. This post helps to understand with sample python script.

Delphi itself has extensive XML and HTML parsing capabilities through TXmlDocument. And here is some sample code for utilizing TXmlDocument in Delphi. If you have an existing Python application though you could make use of the BeautifulSoup Python Library to parse XML and HTML in your Python code. If you need extra speed you could bring the XML or HTML data over to Delphi for faster parsing through Python4Delphi. You can use Python4Delphi a number of different ways such as:

Create a Windows Python GUI around you existing Python app.
Add Python scripting to your Delphi Windows apps.
Add parallel processing to your Python apps through Delphi threads.
Enhance your speed sensitive Python apps with functions from Delphi for more speed.

Prerequisites.

If not python and Python4Delphi is not installed on your machine, Check this, how to run a simple python script in Delphi application using Python4Delphi sample app
Open windows open command prompt, and type pip install -U bs4 to install BeautifulSoup4. For more info for Installing Python Modules check here
First, run the Demo1 project for executing Python script in Python for Delphi. Then load the script in the Memo1 field and press the Execute Script button to see the result. Go to GitHub to download the Demo1 source.

procedure TForm1.Button1Click(Sender: TObject);
begin
 PythonEngine1.ExecStrings( Memo1.Lines );
end;

procedure TForm1.Button1Click(Sender: TObject);

begin

PythonEngine1.ExecStrings( Memo1.Lines );

end;

Beautiful Soup Python Library sample script details: Beautiful Soup works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The sample script demonstrates,

How to transforms a complex HTML document into a complex tree of Python objects( four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.)
How to Navigate the within the tree of Python Objects like Going down, Up, Sideways, Back and Forth, Navigable using Tagnames.
Searching the parse tree Objects using two most popular methods: find() and find_all().
How to modify the tree and write your changes as a new HTML or XML document.

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b id = "boldest"> The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#Simple Html parsing.
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.title)
print(soup.title.name)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
print(soup.a)
# --Kinds of objects.---
tag = soup.b
print(type(tag))
# tag name
print(tag.name)
#tag id
print(tag['id'])

# Navigable string corresponds to a bit of text within a tag.
souptag = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag1 = souptag.b
print(tag1.string)
print(type(tag.string))
#comments
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup1 = BeautifulSoup(markup, 'html.parser')
comment = soup1.b.string
print(type(comment))
#Navigating using tagnames
print(soup.head)
print(soup.title)
# going Up
title_tag = soup.title
print(title_tag)
print(title_tag.parent)

# Search the tree
#find by id
print(soup.find(id="link3"))
# find all with <a> tags
for tag in soup.find_all('a'):
    print(tag)
#Modifying the tree
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
print(tag)
del tag['class']
del tag['id']
print(tag)

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

#Simple Html parsing.

soup = BeautifulSoup(html_doc,'html.parser')

print(soup.title)

print(soup.title.name)

print(soup.title.parent.name)

print(soup.p)

print(soup.p['class'])

print(soup.a)

# --Kinds of objects.---

tag = soup.b

print(type(tag))

# tag name

print(tag.name)

#tag id

print(tag['id'])

# Navigable string corresponds to a bit of text within a tag.

souptag = BeautifulSoup('Extremely bold', 'html.parser')

tag1 = souptag.b

print(tag1.string)

print(type(tag.string))

#comments

markup = ""

soup1 = BeautifulSoup(markup, 'html.parser')

comment = soup1.b.string

print(type(comment))

#Navigating using tagnames

print(soup.head)

print(soup.title)

# going Up

title_tag = soup.title

print(title_tag)

print(title_tag.parent)

# Search the tree

#find by id

print(soup.find(id="link3"))

# find all with <a> tags

for tag in soup.find_all('a'):

print(tag)

#Modifying the tree

soup = BeautifulSoup('Extremely bold', 'html.parser')

tag = soup.b

tag.name = "blockquote"

tag['class'] = 'verybold'

tag['id'] = 1

print(tag)

del tag['class']

del tag['id']

print(tag)

beautifulsoupdemo 4938218 — **BeautifulSoup Python Library Demo**

CSS selector against a parsed document and return all the matching elements. Tag has a similar method which runs a CSS selector against the contents of a single tag. check here for more details.
You can do much more with this library like Output the Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string, Comparing objects for equality, Copying Beautiful Soup objects etc.

Note: Samples used for demonstration were picked from here with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.

You have read the quick overview of Beautiful Soup library, download this library from here and pull data out of html, xml easily in your applications. Check out Python4Delphi and easily build Python GUIs for Windows using Delphi.

Reduce development time and get to market faster with RAD Studio, Delphi, or C++Builder.
Design. Code. Compile. Deploy.

Start Free Trial Upgrade Today

Free Delphi Community Edition Free C++Builder Community Edition

Quickly Parse HTML And XML With BeautifulSoup Python Library In Delphi And C++ Windows Apps

Leave a ReplyCancel reply

Search

Something Fresh

Compliance-Ready Applications for Regulated Software Teams With Delphi

Join the International Pascal Congress in June in Salamanca, Spain

Server-Sent Events (SSE): Getting Real-Time Updates in Your Apps

Popular Posts

Announcing the Availability of RAD Studio 13 Florence Update 1

The Spirit of C++: Freedom, Responsibility, and the Reality of Complex Systems

A Summary of Year 2025 for RAD Studio, Delphi, and C++Builder

Is C++ Too Complex?

Rethinking C++: Ignorance, Surface, and Deep Architecture

Categories

Popular From News

New in 10.3.2: C++17 for Win64 - target all Windows with the C++17 Clang compiler

Delphi 12 And C++Builder 12 Community Editions Released!

Submit Your Own Amazing Projects To The Embarcadero Showcase

We've Updated The HUGE Delphi Anniversary “Innovation Timeline” Infographic. Grab it Now!

Embarcadero InterBase 2020 Update 6 Released!

C++Builder @ stackoverflow

Delphi @ stackoverflow

InterBase @ stackoverflow

Categories

Useful Links

Follow us

Quickly Parse HTML And XML With BeautifulSoup Python Library In Delphi And C++ Windows Apps

Leave a ReplyCancel reply

Join Our Global Developer Community

Search

Something Fresh

Popular Posts

Categories

Popular From News

Categories

Useful Links

Follow us