emergingtrajectories.pdf

This is a very simple set of utility function(s) for loading PDF content. In fact, it might be easier to just use PyPDF directly and avoid this altogether. In the future, we might create specialized functions and classes for doing “fancy” things with PDFs (e.g., OCR, tables, etc.) so have created this module as a way to keep this in mind.

Functions

get_PDF_content_from_file_by_page(→ list)

Loads a PDF file and extracts the text into a list of strings, one for each page.

get_PDF_content_from_url_by_page(→ list)

Loads a PDF file from a URL and extracts the text into a list of strings, one for each page.

get_PDF_content_by_page_from_file(→ str)

Loads a PDF file and extracts the text into one big string.

get_PDF_content_by_page_from_url(→ str)

Loads a PDF file from a URL and extracts the text into one big string.

Module Contents

emergingtrajectories.pdf.get_PDF_content_from_file_by_page(file_path: str) list

Loads a PDF file and extracts the text into a list of strings, one for each page.

Parameters:

file_path (str) – The path to the PDF file.

Returns:

A list of strings, one for each page.

Return type:

list

emergingtrajectories.pdf.get_PDF_content_from_url_by_page(url: str) list

Loads a PDF file from a URL and extracts the text into a list of strings, one for each page.

Parameters:

url (str) – The URL to the PDF file.

Returns:

A list of strings, one for each page.

Return type:

list

emergingtrajectories.pdf.get_PDF_content_by_page_from_file(file_path: str) str

Loads a PDF file and extracts the text into one big string.

Parameters:

file_path (str) – The path to the PDF file.

Returns:

The text content of the PDF file.

Return type:

str

emergingtrajectories.pdf.get_PDF_content_by_page_from_url(url: str) str

Loads a PDF file from a URL and extracts the text into one big string.

Parameters:

url (str) – The URL to the PDF file.

Returns:

The text content of the PDF file.

Return type:

str