HTMLaaS | Simplifying HTML Data Extraction with FastAPI

Photo by Max Duzij on Unsplash

HTMLaaS | Simplifying HTML Data Extraction with FastAPI

Introduction

In the world of web development and data analysis, extracting specific information from HTML documents is a common task. Whether it's scraping data from websites or processing HTML files, developers often face challenges in efficiently retrieving desired data. To address this, I have developed HTMLaaS (HTML as a Service), a powerful tool that allows users to query and extract specific information from HTML documents seamlessly. Powered by FastAPI, HTMLaaS simplifies working with HTML, providing users with an intuitive interface to retrieve page titles, extract links, access the entire HTML content, and query elements using tag names, classes, and IDs.

Where can HTMLaaS be used?

  1. Web Scraping: HTMLaaS enables developers to extract desired data from websites without having to write complex scraping scripts. By leveraging the provided API endpoints, users can easily retrieve specific elements, such as article titles, prices, or product descriptions.

  2. Data Analysis: Researchers and analysts often need to process HTML documents to extract valuable insights. HTMLaaS simplifies this process by allowing users to query and extract specific elements, facilitating data extraction and analysis.

  3. Content Extraction: Content management systems, news aggregators, and online publishers can leverage HTMLaaS to efficiently extract relevant information from HTML documents. This enables them to automate content processing and streamline their operations.

Source code - Here

Code Walkthrough

from fastapi import FastAPI
from pydantic import BaseModel
import requests
from bs4 import BeautifulSoup

app = FastAPI()
  • The code imports the necessary dependencies: FastAPI for building the API, BaseModel from pydantic for creating request/response models, requests for making HTTP requests, and BeautifulSoup from the bs4 library for parsing HTML content.

  • The app variable is initialized as a FastAPI application.

class Url(BaseModel):
    url: str


class Element(BaseModel):
    url: str
    property: str
  • Two Pydantic models, Url and Element, are defined as request models. These models define the structure of the incoming JSON data for the corresponding endpoints.
def parse_url(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    return soup
  • The parse_url function takes a URL as an input, sends a GET request to retrieve the HTML content, and parses it using BeautifulSoup. It returns the parsed HTML object.
@app.get("/")
async def root():
    return {"status": "200", "message": "hello from fastapi"}
  • This is the root endpoint ("/") defined as a GET request. It simply returns a JSON response with a status code and a message.
@app.post("/")
async def root(url: Url):
    soup = parse_url(url.url)
    return {"page": str(soup)}
  • This endpoint ("/") is defined as a POST request. It takes a JSON payload containing the URL in the url field of the Url model.

  • The parse_url function is called to retrieve and parse the HTML content.

  • The parsed HTML object is returned as a string in the JSON response.

@app.post("/title")
async def root(url: Url):
    soup = parse_url(url.url)
    return {"title": soup.title.string}
  • This endpoint ("/title") is defined as a POST request. It takes a JSON payload containing the URL in the url field of the Url model.

  • The parse_url function is called to retrieve and parse the HTML content.

  • The title of the HTML page is extracted using soup.title.string and returned in the JSON response.

@app.post("/element")
async def root(url: Element):
    soup = parse_url(url.url)
    elements = []
    for element in soup.find_all(url.property):
        elements.append(str(element))
    return {"elements": elements}
  • This endpoint ("/element") is defined as a POST request. It takes a JSON payload containing the URL in the url field and the desired HTML element property (tag name, class, or ID) in the property field of the Element model.

  • The parse_url function is called to retrieve and parse the HTML content.

  • Using soup.find_all(url.property), the desired elements are extracted from the HTML page and converted to strings.

  • The extracted elements are returned in the JSON response.

@app.post("/links")
async def root(url: Url):
    soup = parse_url(url.url)
    links = []
    for link in soup.find_all("a"):
        links.append(link.get("href"))
    return {"links": links}
  • This endpoint ("/links") is defined as a POST request. It takes a JSON payload containing the URL in the url field of the Url model.

  • The parse_url function is called to retrieve and parse the HTML content.

  • All the links (`<a> tags) in the HTML page are extracted using soup.find_all("a").

  • The href attribute of each link is retrieved using link.get("href").

  • The extracted links are returned in the JSON response.

@app.post("/class")
async def root(url: Element):
    soup = parse_url(url.url)
    elements = []
    for element in soup.find_all(class_=url.property):
        elements.append(str(element))
    return {"class": elements}
  • This endpoint ("/class") is defined as a POST request. It takes a JSON payload containing the URL in the url field and the desired HTML class in the property field of the Element model.

  • The parse_url function is called to retrieve and parse the HTML content.

  • Elements with the specified class (url.property) are extracted from the HTML page using soup.find_all(class_=url.property).

  • The extracted elements are returned in the JSON response.

@app.post("/id")
async def root(url: Element):
    soup = parse_url(url.url)
    elements = []
    for element in soup.find_all(id=url.property):
        elements.append(str(element))
    return {"id": elements}
  • This endpoint ("/id") is defined as a POST request. It takes a JSON payload containing the URL in the url field and the desired HTML ID in the property field of the Element model.

  • The parse_url function is called to retrieve and parse the HTML content.

  • Elements with the specified ID (url.property) are extracted from the HTML page using soup.find_all(id=url.property).

  • The extracted elements are returned in the JSON response.

Conclusion

In conclusion, HTMLaaS powered by FastAPI simplifies HTML data extraction by providing an intuitive interface. It offers advantages over paid APIs, and its code can be used as a template for extending its capabilities based on specific use cases. Explore HTMLaaS and unleash its potential for efficient HTML data extraction.

Furthermore, I encourage you to use the provided code as a template and extend its capabilities to suit your specific use cases. Feel free to modify and enhance the existing endpoints or add new ones to meet your requirements.