latexml_html_cleaner package

Submodules

latexml_html_cleaner.clean_html module

Class definition of htmlcleaner

class latexml_html_cleaner.clean_html.HTMLCleaner(filename, skip_tags=None, overwrite=False, find_and_replace_patterns=None, clear_default_patterns=False, output_filename=None)[source]

Bases: object

Class to clean the contents of an html-file

Parameters:
  • filename (Path) – path to the html-file to clean

  • skip_tags (bool, optional) – do not use the default tags

  • overwrite (bool, optional) – overwrite the existing file if it exists

  • find_and_replace_patterns – (dict, optional): replace these patterns

  • clear_default_patterns (bool, optional) – clear the default patterns if they exist

  • output_filename (Path, optional) – filename to write too. If not given, base the name on the input file

clean_soup

BeautifulSoup object

Type:

BeautifulSoup

Notes

  • By default, all attributes starting with ltx are skipped.

  • With `skip_tags` we can drop entire `<>` environments based on the environments name (the key of the dict) and then a list of attributes key/values pairs.

  • If such a key/value pair occurs, the entire `<>` tag is discarded.

  • We first define a default list in this example: a tag `<span class="ltx_bibblock ltx_bib_cited">Cited by etc. </span>` is discarded in its entirely, including all nested values.

  • We can also specify the values of a key/value pair in a list if there are more than one tags have the same key names, but different values.

clean_html()[source]

Read the html file and clean the html code

latexml_html_cleaner.clean_html.skip_this_tag(tag, attribute_key, attribute_values, skip_tags, combined=False)[source]

Collect all the tags and attributes we want to remove

Parameters:
  • tag (object) – beautiful soup tag to clean

  • attribute_key (str) – key of the attribute

  • attribute_values (list) – values of the attribute

  • skip_tags (bool) – skip the tag if true

  • combined (bool, optional) – only remove the tag in case we match the combined tag

Returns:

all the tags and attributes to skip

Return type:

list

latexml_html_cleaner.clean_html.smart_open(filename=None)[source]

Context manager for a smart file opener for reading from file or standard input

Parameters:

filename (Path) – Path to the file to open

Returns:

file-like object

Return type:

file

latexml_html_cleaner.clean_html.to_lijst(values)[source]

Convert a string or a list of strings to a list of strings

latexml_html_cleaner.main module

This conversion script cleans an HTML file generated with latex such that it can be read easier into the sitescore

Run htmlcleaner –help to get the help message:

usage: htmlcleaner [-h] [--version] [--output_filename STR] [-v] [-vv] [-w] [-f [PATH ...]]
                   [--clear_find_and_replace_defaults]
                   STR [STR ...]

Cleans html files and removes hyperrefs

positional arguments:
  STR                   File name of html input

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --output_filename STR
                        File name of output html file
  -v, --verbose         set loglevel to INFO
  -vv, --very-verbose, --debug
                        set loglevel to DEBUG
  -w, --overwrite       Overwrite the input html. Default = False, which means a new html is created withthe suffix
                        _clean
  -f [PATH ...], --find_and_replace [PATH ...]
                    Define a list of key=value pairs to define string patterns you want to replace
  --clear_find_and_replace_defaults
                        Clear the predefined find and replace patterns
latexml_html_cleaner.main.main(args)[source]

Wrapper allowing fib() to be called with string arguments in a CLI fashion

Instead of returning the value from fib(), it prints the result to the stdout in a nicely formatted message.

Parameters:

args (List[str]) – command line parameters as list of strings (for example ["--verbose", "42"]).

latexml_html_cleaner.main.parse_args(args)[source]

Parse command line parameters

Parameters:

args (List[str]) – command line parameters as list of strings (for example ["--help"]).

Returns:

command line parameters namespace

Return type:

argparse.Namespace

latexml_html_cleaner.main.parse_var(s)[source]

Parse a key, value pair, separated by ‘=’ That’s the reverse of ShellArgs.

On the command line (argparse) a declaration will typically look like:

foo=hello

or

foo=”hello world”

latexml_html_cleaner.main.parse_vars(items)[source]

Parse a series of key-value pairs and return a dictionary

latexml_html_cleaner.main.run()[source]

Calls main() passing the CLI arguments extracted from sys.argv

This function can be used as entry point to create console scripts with setuptools.

latexml_html_cleaner.main.setup_logging(loglevel)[source]

Setup basic logging

Parameters:

loglevel (int) – minimum loglevel for emitting messages

Module contents