emergingtrajectories.news

Classes

RSSAgent

A simple wrapper for an RSS feed, so we can query it for URLs.

NewsBingAgent

Creates a new Bing News API agent. To learn more, see: https://github.com/microsoft/bing-search-sdk-for-python/

NewsAPIAgent

A simple wrapper for the News API, so we can query it for URLs.

FinancialTimesAgent

This is a POC agent that uses Playwright to crawl the Financial Times articles you are interested in. Note that you NEED to be a subscriber to the FT to make this work, and thus need to provide your FT user name and password.

Functions

force_empty_content(→ None)

Force the crawler to visit every URL in the RSS feed and save it as a blank content file. We do this because some RSS feeds have a lot of old URLs we do not need to crawl, and only want to crawl the delta over some period.

Module Contents

emergingtrajectories.news.force_empty_content(rss_url: str, content, cache_function) None

Force the crawler to visit every URL in the RSS feed and save it as a blank content file. We do this because some RSS feeds have a lot of old URLs we do not need to crawl, and only want to crawl the delta over some period.

Parameters:
  • rss_url (str) – The URL of the RSS feed.

  • content – the content string to save.

  • cache_function – the specific function to call the rss_url and content to save.

class emergingtrajectories.news.RSSAgent(rss_url, crawler=None)

A simple wrapper for an RSS feed, so we can query it for URLs.

Parameters:
  • rss_url (str) – The URL of the RSS feed.

  • crawler (Crawler, optional) – The crawler to use. Defaults to None, in which case we will use crawlerPlaywright in headless mode.

rss_url
get_news_as_list() list

Query the RSS feed for news articles, and return them as a list of dictionaries.

Returns:

A list of URLs.

Return type:

list

class emergingtrajectories.news.NewsBingAgent(api_key: str, endpoint: str)

Creates a new Bing News API agent. To learn more, see: https://github.com/microsoft/bing-search-sdk-for-python/

Parameters:
  • api_key (str) – The Bing News API key.

  • endpoint (str) – The Bing News API endpoint.

api_key
endpoint
get_news_as_list(query: str, market: str = 'en-us') list

Gets a list of URLS from the Bing News API. For more information on markets, see: https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes

Parameters:
  • query (str) – The query to search for.

  • market (str, optional) – The market to search in. Defaults to “en-us”. (US English

Returns:

A list of URLs.

Return type:

list

class emergingtrajectories.news.NewsAPIAgent(api_key, top_headlines=False, crawler=None)

A simple wrapper for the News API, so we can query it for URLs.

Parameters:
  • api_key (str) – The News API key.

  • top_headlines (bool, optional) – Whether to get top headlines. Defaults to False.

  • crawler (Crawler, optional) – The crawler to use. Defaults to None, in which case we will use crawlerPlaywright in headless mode.

api_key
top_headlines = False
get_news_as_list(query: str) list

Query the News API for news articles, and return them as a list of dictionaries.

Parameters:

query (str) – The query to search for.

Returns:

A list of dictionaries, where each dictionary represents a news article.

Return type:

list

class emergingtrajectories.news.FinancialTimesAgent(user_email, user_password)

This is a POC agent that uses Playwright to crawl the Financial Times articles you are interested in. Note that you NEED to be a subscriber to the FT to make this work, and thus need to provide your FT user name and password.

Parameters:
  • user_email (str) – Your FT email.

  • user_password (str) – Your FT password.

ft_rss_feed_urls = ['https://www.ft.com/rss/home', 'https://www.ft.com/world?format=rss',...
ft_login_url = 'https://ft.com/login'
ft_main_url = 'https://ft.com/'
user_email
user_password
get_news(urls: list[str] = None) list

Get the news from the Financial Times as a list of tuples, where each tuple contains the URL and the extracted text content.

Parameters:

urls – a list of URLs to get content for.

Returns:

A list of lists – urls, html, and text content