emergingtrajectories.news¶
Classes¶
A simple wrapper for an RSS feed, so we can query it for URLs. |
|
Creates a new Bing News API agent. To learn more, see: https://github.com/microsoft/bing-search-sdk-for-python/ |
|
A simple wrapper for the News API, so we can query it for URLs. |
|
This is a POC agent that uses Playwright to crawl the Financial Times articles you are interested in. Note that you NEED to be a subscriber to the FT to make this work, and thus need to provide your FT user name and password. |
Functions¶
|
Force the crawler to visit every URL in the RSS feed and save it as a blank content file. We do this because some RSS feeds have a lot of old URLs we do not need to crawl, and only want to crawl the delta over some period. |
Module Contents¶
- emergingtrajectories.news.force_empty_content(rss_url: str, content, cache_function) None¶
Force the crawler to visit every URL in the RSS feed and save it as a blank content file. We do this because some RSS feeds have a lot of old URLs we do not need to crawl, and only want to crawl the delta over some period.
- Parameters:
rss_url (str) – The URL of the RSS feed.
content – the content string to save.
cache_function – the specific function to call the rss_url and content to save.
- class emergingtrajectories.news.RSSAgent(rss_url, crawler=None)¶
A simple wrapper for an RSS feed, so we can query it for URLs.
- Parameters:
rss_url (str) – The URL of the RSS feed.
crawler (Crawler, optional) – The crawler to use. Defaults to None, in which case we will use crawlerPlaywright in headless mode.
- rss_url¶
- get_news_as_list() list¶
Query the RSS feed for news articles, and return them as a list of dictionaries.
- Returns:
A list of URLs.
- Return type:
list
- class emergingtrajectories.news.NewsBingAgent(api_key: str, endpoint: str)¶
Creates a new Bing News API agent. To learn more, see: https://github.com/microsoft/bing-search-sdk-for-python/
- Parameters:
api_key (str) – The Bing News API key.
endpoint (str) – The Bing News API endpoint.
- api_key¶
- endpoint¶
- get_news_as_list(query: str, market: str = 'en-us') list¶
Gets a list of URLS from the Bing News API. For more information on markets, see: https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes
- Parameters:
query (str) – The query to search for.
market (str, optional) – The market to search in. Defaults to “en-us”. (US English
- Returns:
A list of URLs.
- Return type:
list
- class emergingtrajectories.news.NewsAPIAgent(api_key, top_headlines=False, crawler=None)¶
A simple wrapper for the News API, so we can query it for URLs.
- Parameters:
api_key (str) – The News API key.
top_headlines (bool, optional) – Whether to get top headlines. Defaults to False.
crawler (Crawler, optional) – The crawler to use. Defaults to None, in which case we will use crawlerPlaywright in headless mode.
- api_key¶
- top_headlines = False¶
- get_news_as_list(query: str) list¶
Query the News API for news articles, and return them as a list of dictionaries.
- Parameters:
query (str) – The query to search for.
- Returns:
A list of dictionaries, where each dictionary represents a news article.
- Return type:
list
- class emergingtrajectories.news.FinancialTimesAgent(user_email, user_password)¶
This is a POC agent that uses Playwright to crawl the Financial Times articles you are interested in. Note that you NEED to be a subscriber to the FT to make this work, and thus need to provide your FT user name and password.
- Parameters:
user_email (str) – Your FT email.
user_password (str) – Your FT password.
- ft_rss_feed_urls = ['https://www.ft.com/rss/home', 'https://www.ft.com/world?format=rss',...¶
- ft_login_url = 'https://ft.com/login'¶
- ft_main_url = 'https://ft.com/'¶
- user_email¶
- user_password¶
- get_news(urls: list[str] = None) list¶
Get the news from the Financial Times as a list of tuples, where each tuple contains the URL and the extracted text content.
- Parameters:
urls – a list of URLs to get content for.
- Returns:
A list of lists – urls, html, and text content