HTMLaaS | Simplifying HTML Data Extraction with FastAPI
Table of contents
Introduction
In the world of web development and data analysis, extracting specific information from HTML documents is a common task. Whether it's scraping data from websites or processing HTML files, developers often face challenges in efficiently retrieving desired data. To address this, I have developed HTMLaaS (HTML as a Service), a powerful tool that allows users to query and extract specific information from HTML documents seamlessly. Powered by FastAPI, HTMLaaS simplifies working with HTML, providing users with an intuitive interface to retrieve page titles, extract links, access the entire HTML content, and query elements using tag names, classes, and IDs.
Where can HTMLaaS be used?
Web Scraping: HTMLaaS enables developers to extract desired data from websites without having to write complex scraping scripts. By leveraging the provided API endpoints, users can easily retrieve specific elements, such as article titles, prices, or product descriptions.
Data Analysis: Researchers and analysts often need to process HTML documents to extract valuable insights. HTMLaaS simplifies this process by allowing users to query and extract specific elements, facilitating data extraction and analysis.
Content Extraction: Content management systems, news aggregators, and online publishers can leverage HTMLaaS to efficiently extract relevant information from HTML documents. This enables them to automate content processing and streamline their operations.
Source code - Here
Code Walkthrough
from fastapi import FastAPI
from pydantic import BaseModel
import requests
from bs4 import BeautifulSoup
app = FastAPI()
The code imports the necessary dependencies: FastAPI for building the API, BaseModel from pydantic for creating request/response models, requests for making HTTP requests, and BeautifulSoup from the bs4 library for parsing HTML content.
The
app
variable is initialized as a FastAPI application.
class Url(BaseModel):
url: str
class Element(BaseModel):
url: str
property: str
- Two Pydantic models,
Url
andElement
, are defined as request models. These models define the structure of the incoming JSON data for the corresponding endpoints.
def parse_url(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
return soup
- The
parse_url
function takes a URL as an input, sends a GET request to retrieve the HTML content, and parses it using BeautifulSoup. It returns the parsed HTML object.
@app.get("/")
async def root():
return {"status": "200", "message": "hello from fastapi"}
- This is the root endpoint ("/") defined as a GET request. It simply returns a JSON response with a status code and a message.
@app.post("/")
async def root(url: Url):
soup = parse_url(url.url)
return {"page": str(soup)}
This endpoint ("/") is defined as a POST request. It takes a JSON payload containing the URL in the
url
field of theUrl
model.The
parse_url
function is called to retrieve and parse the HTML content.The parsed HTML object is returned as a string in the JSON response.
@app.post("/title")
async def root(url: Url):
soup = parse_url(url.url)
return {"title": soup.title.string}
This endpoint ("/title") is defined as a POST request. It takes a JSON payload containing the URL in the
url
field of theUrl
model.The
parse_url
function is called to retrieve and parse the HTML content.The title of the HTML page is extracted using
soup.title.string
and returned in the JSON response.
@app.post("/element")
async def root(url: Element):
soup = parse_url(url.url)
elements = []
for element in soup.find_all(url.property):
elements.append(str(element))
return {"elements": elements}
This endpoint ("/element") is defined as a POST request. It takes a JSON payload containing the URL in the
url
field and the desired HTML element property (tag name, class, or ID) in theproperty
field of theElement
model.The
parse_url
function is called to retrieve and parse the HTML content.Using
soup.find_all(
url.property
)
, the desired elements are extracted from the HTML page and converted to strings.The extracted elements are returned in the JSON response.
@app.post("/links")
async def root(url: Url):
soup = parse_url(url.url)
links = []
for link in soup.find_all("a"):
links.append(link.get("href"))
return {"links": links}
This endpoint ("/links") is defined as a POST request. It takes a JSON payload containing the URL in the
url
field of theUrl
model.The
parse_url
function is called to retrieve and parse the HTML content.All the links (`
<a>
tags) in the HTML page are extracted usingsoup.find_all("a")
.The
href
attribute of each link is retrieved usinglink.get("href")
.The extracted links are returned in the JSON response.
@app.post("/class")
async def root(url: Element):
soup = parse_url(url.url)
elements = []
for element in soup.find_all(class_=url.property):
elements.append(str(element))
return {"class": elements}
This endpoint ("/class") is defined as a POST request. It takes a JSON payload containing the URL in the
url
field and the desired HTML class in theproperty
field of theElement
model.The
parse_url
function is called to retrieve and parse the HTML content.Elements with the specified class (
url.property
) are extracted from the HTML page usingsoup.find_all(class_=url.property)
.The extracted elements are returned in the JSON response.
@app.post("/id")
async def root(url: Element):
soup = parse_url(url.url)
elements = []
for element in soup.find_all(id=url.property):
elements.append(str(element))
return {"id": elements}
This endpoint ("/id") is defined as a POST request. It takes a JSON payload containing the URL in the
url
field and the desired HTML ID in theproperty
field of theElement
model.The
parse_url
function is called to retrieve and parse the HTML content.Elements with the specified ID (
url.property
) are extracted from the HTML page usingsoup.find_all(id=url.property)
.The extracted elements are returned in the JSON response.
Conclusion
In conclusion, HTMLaaS powered by FastAPI simplifies HTML data extraction by providing an intuitive interface. It offers advantages over paid APIs, and its code can be used as a template for extending its capabilities based on specific use cases. Explore HTMLaaS and unleash its potential for efficient HTML data extraction.
Furthermore, I encourage you to use the provided code as a template and extend its capabilities to suit your specific use cases. Feel free to modify and enhance the existing endpoints or add new ones to meet your requirements.