latexml_html_cleaner package

Submodules

latexml_html_cleaner.clean_html module

Class definition of htmlcleaner

class latexml_html_cleaner.clean_html.HTMLCleaner(filename, skip_tags=None, overwrite=False, find_and_replace_patterns=None, clear_default_patterns=False, output_filename=None)[source]

Bases: object

Class to clean the contents of an html-file

Parameters:

filename (Path) – path to the html-file to clean
skip_tags (bool, optional) – do not use the default tags
overwrite (bool, optional) – overwrite the existing file if it exists
find_and_replace_patterns – (dict, optional): replace these patterns
clear_default_patterns (bool, optional) – clear the default patterns if they exist
output_filename (Path, optional) – filename to write too. If not given, base the name on the input file

clean_soup

BeautifulSoup object

Type:: BeautifulSoup

Notes

By default, all attributes starting with ltx are skipped.
With `skip_tags` we can drop entire `<>` environments based on the environments name (the key of the dict) and then a list of attributes key/values pairs.
If such a key/value pair occurs, the entire `<>` tag is discarded.
We first define a default list in this example: a tag `<span class="ltx_bibblock ltx_bib_cited">Cited by etc. </span>` is discarded in its entirely, including all nested values.
We can also specify the values of a key/value pair in a list if there are more than one tags have the same key names, but different values.

clean_html()[source]: Read the html file and clean the html code

latexml_html_cleaner.clean_html.skip_this_tag(tag, attribute_key, attribute_values, skip_tags, combined=False)[source]

Collect all the tags and attributes we want to remove

Parameters:

tag (object) – beautiful soup tag to clean
attribute_key (str) – key of the attribute
attribute_values (list) – values of the attribute
skip_tags (bool) – skip the tag if true
combined (bool, optional) – only remove the tag in case we match the combined tag

Returns:

all the tags and attributes to skip

Return type:

list

latexml_html_cleaner.clean_html.smart_open(filename=None)[source]

Context manager for a smart file opener for reading from file or standard input

Parameters:: filename (Path) – Path to the file to open
Returns:: file-like object
Return type:: file

latexml_html_cleaner.clean_html.to_lijst(values)[source]: Convert a string or a list of strings to a list of strings

latexml_html_cleaner.main module

This conversion script cleans an HTML file generated with latex such that it can be read easier into the sitescore

Run htmlcleaner –help to get the help message:

usage: htmlcleaner [-h] [--version] [--output_filename STR] [-v] [-vv] [-w] [-f [PATH ...]]
                   [--clear_find_and_replace_defaults]
                   STR [STR ...]

Cleans html files and removes hyperrefs

positional arguments:
  STR                   File name of html input

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --output_filename STR
                        File name of output html file
  -v, --verbose         set loglevel to INFO
  -vv, --very-verbose, --debug
                        set loglevel to DEBUG
  -w, --overwrite       Overwrite the input html. Default = False, which means a new html is created withthe suffix
                        _clean
  -f [PATH ...], --find_and_replace [PATH ...]
                    Define a list of key=value pairs to define string patterns you want to replace
  --clear_find_and_replace_defaults
                        Clear the predefined find and replace patterns

latexml_html_cleaner.main.main(args)[source]

Wrapper allowing fib() to be called with string arguments in a CLI fashion

Instead of returning the value from fib(), it prints the result to the stdout in a nicely formatted message.

Parameters:: args (List[str]) – command line parameters as list of strings (for example ["--verbose", "42"]).

latexml_html_cleaner.main.parse_args(args)[source]

Parse command line parameters

Parameters:: args (List[str]) – command line parameters as list of strings (for example ["--help"]).
Returns:: command line parameters namespace
Return type:: argparse.Namespace

latexml_html_cleaner.main.parse_var(s)[source]

Parse a key, value pair, separated by ‘=’ That’s the reverse of ShellArgs.

On the command line (argparse) a declaration will typically look like:: foo=hello
or: foo=”hello world”

latexml_html_cleaner.main.parse_vars(items)[source]: Parse a series of key-value pairs and return a dictionary

latexml_html_cleaner.main.run()[source]

Calls main() passing the CLI arguments extracted from sys.argv

This function can be used as entry point to create console scripts with setuptools.

latexml_html_cleaner.main.setup_logging(loglevel)[source]

Setup basic logging

Parameters:: loglevel (int) – minimum loglevel for emitting messages

latexml_html_cleaner package

Submodules

latexml_html_cleaner.clean_html module

latexml_html_cleaner.main module

Module contents