emergingtrajectories.crawlers

Crawlers provide a standardized approach to interacting with with web pages and extracting information. We have a number of crawlers based on PhaseLLM (Python requests) and ones using Playwright (headlessly and with a front-end) to enable flexible scraping.

All scraping agents return the raw HTML content and the extracted text content.

Classes

crawlerPlaywright

Crawler that uses Playwright to scrape web pages.

crawlerPhaseLLM

PhaseLLM scraper. Uses Python requests and does not execute JS.

crawlerScrapingBee

Crawler that uses ScrapingBee to scrape web pages.

Functions

_bs4_childtraversal(→ str)

Recursively travserse the DOM to extract content.

_get_text_bs4(→ str)

Extract text content from HTML using BeautifulSoup.

Module Contents

emergingtrajectories.crawlers._bs4_childtraversal(html: str) str

Recursively travserse the DOM to extract content.

Parameters:

html (str) – HTML content

Returns:

Extracted content

Return type:

str

emergingtrajectories.crawlers._get_text_bs4(html: str) str

Extract text content from HTML using BeautifulSoup.

Parameters:

html (str) – HTML content

Returns:

Extracted text content

Return type:

str

class emergingtrajectories.crawlers.crawlerPlaywright(headless: bool = True)

Crawler that uses Playwright to scrape web pages.

Parameters:

headless (bool, optional) – Run the browser in headless mode. Defaults to True.

headless = True
get_content(url: str) tuple[str, str]

Gets content for a specific URL.

Parameters:

url (str) – URL to scrape

Returns:

Raw HTML content and extracted text content (in this order)

Return type:

tuple[str, str]

class emergingtrajectories.crawlers.crawlerPhaseLLM

PhaseLLM scraper. Uses Python requests and does not execute JS.

scraper
get_content(url)

Gets content for a specific URL.

Parameters:

url (str) – URL to scrape

Returns:

Raw HTML content and extracted text content (in this order)

Return type:

tuple[str, str]

class emergingtrajectories.crawlers.crawlerScrapingBee(api_key: str)

Crawler that uses ScrapingBee to scrape web pages.

client
get_content(url)

Gets content for a specific URL.

Parameters:

url (str) – URL to scrape

Returns:

Raw HTML content and extracted text content (in this order)

Return type:

tuple[str, str]