We know how to load and display Web content or local files in Delphi using TWebBrowser. It offers support for the basic functions of a browser, such as navigate to URL, go back, go forward, along with specific events. How about the web scrapping in Delphi using the Python BeautifulSoup library? Sounds Interesting? Yes, with the help of Python4Delphi we can scrap the web pages quickly in the Delphi/C++ Builder app. This post helps to understand with sample python script.
Delphi itself has extensive XML and HTML parsing capabilities through TXmlDocument. And here is some sample code for utilizing TXmlDocument in Delphi. If you have an existing Python application though you could make use of the BeautifulSoup Python Library to parse XML and HTML in your Python code. If you need extra speed you could bring the XML or HTML data over to Delphi for faster parsing through Python4Delphi. You can use Python4Delphi a number of different ways such as:
- Create a Windows Python GUI around you existing Python app.
- Add Python scripting to your Delphi Windows apps.
- Add parallel processing to your Python apps through Delphi threads.
- Enhance your speed sensitive Python apps with functions from Delphi for more speed.
Prerequisites.
- If not python and Python4Delphi is not installed on your machine, Check this, how to run a simple python script in Delphi application using Python4Delphi sample app
- Open windows open command prompt, and type pip install -U bs4 to install BeautifulSoup4. For more info for Installing Python Modules check here
- First, run the Demo1 project for executing Python script in Python for Delphi. Then load the script in the Memo1 field and press the Execute Script button to see the result. Go to GitHub to download the Demo1 source.
1 2 3 4 |
procedure TForm1.Button1Click(Sender: TObject); begin PythonEngine1.ExecStrings( Memo1.Lines ); end; |
Beautiful Soup Python Library sample script details: Beautiful Soup works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The sample script demonstrates,
- How to transforms a complex HTML document into a complex tree of Python objects( four kinds of objects:
Tag
,NavigableString
,BeautifulSoup
, andComment
.) - How to Navigate the within the tree of Python Objects like Going down, Up, Sideways, Back and Forth, Navigable using Tagnames.
- Searching the parse tree Objects using two most popular methods:
find()
andfind_all()
. - How to modify the tree and write your changes as a new HTML or XML document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
from bs4 import BeautifulSoup html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b id = "boldest"> The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #Simple Html parsing. soup = BeautifulSoup(html_doc,'html.parser') print(soup.title) print(soup.title.name) print(soup.title.parent.name) print(soup.p) print(soup.p['class']) print(soup.a) # --Kinds of objects.--- tag = soup.b print(type(tag)) # tag name print(tag.name) #tag id print(tag['id']) # Navigable string corresponds to a bit of text within a tag. souptag = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') tag1 = souptag.b print(tag1.string) print(type(tag.string)) #comments markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup1 = BeautifulSoup(markup, 'html.parser') comment = soup1.b.string print(type(comment)) #Navigating using tagnames print(soup.head) print(soup.title) # going Up title_tag = soup.title print(title_tag) print(title_tag.parent) # Search the tree #find by id print(soup.find(id="link3")) # find all with <a> tags for tag in soup.find_all('a'): print(tag) #Modifying the tree soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') tag = soup.b tag.name = "blockquote" tag['class'] = 'verybold' tag['id'] = 1 print(tag) del tag['class'] del tag['id'] print(tag) |
- CSS selector against a parsed document and return all the matching elements.
Tag
has a similar method which runs a CSS selector against the contents of a single tag. check here for more details. - You can do much more with this library like Output the Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string, Comparing objects for equality, Copying Beautiful Soup objects etc.
Note: Samples used for demonstration were picked from here with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.
You have read the quick overview of Beautiful Soup library, download this library from here and pull data out of html, xml easily in your applications. Check out Python4Delphi and easily build Python GUIs for Windows using Delphi.
Design. Code. Compile. Deploy.
Start Free Trial Upgrade Today
Free Delphi Community Edition Free C++Builder Community Edition