latexml_html_cleaner package
Submodules
latexml_html_cleaner.clean_html module
Class definition of htmlcleaner
- class latexml_html_cleaner.clean_html.HTMLCleaner(filename, skip_tags=None, overwrite=False, find_and_replace_patterns=None, clear_default_patterns=False, output_filename=None)[source]
Bases:
objectClass to clean the contents of an html-file
- Parameters:
filename (Path) – path to the html-file to clean
skip_tags (bool, optional) – do not use the default tags
overwrite (bool, optional) – overwrite the existing file if it exists
find_and_replace_patterns – (dict, optional): replace these patterns
clear_default_patterns (bool, optional) – clear the default patterns if they exist
output_filename (Path, optional) – filename to write too. If not given, base the name on the input file
- clean_soup
BeautifulSoup object
- Type:
BeautifulSoup
Notes
By default, all attributes starting with ltx are skipped.
With
`skip_tags`we can drop entire`<>`environments based on the environments name (the key of the dict) and then a list of attributes key/values pairs.If such a key/value pair occurs, the entire
`<>`tag is discarded.We first define a default list in this example: a tag
`<span class="ltx_bibblock ltx_bib_cited">Cited by etc. </span>`is discarded in its entirely, including all nested values.We can also specify the values of a key/value pair in a list if there are more than one tags have the same key names, but different values.
- latexml_html_cleaner.clean_html.skip_this_tag(tag, attribute_key, attribute_values, skip_tags, combined=False)[source]
Collect all the tags and attributes we want to remove
- Parameters:
- Returns:
all the tags and attributes to skip
- Return type:
latexml_html_cleaner.main module
This conversion script cleans an HTML file generated with latex such that it can be read easier into the sitescore
Run htmlcleaner –help to get the help message:
usage: htmlcleaner [-h] [--version] [--output_filename STR] [-v] [-vv] [-w] [-f [PATH ...]]
[--clear_find_and_replace_defaults]
STR [STR ...]
Cleans html files and removes hyperrefs
positional arguments:
STR File name of html input
options:
-h, --help show this help message and exit
--version show program's version number and exit
--output_filename STR
File name of output html file
-v, --verbose set loglevel to INFO
-vv, --very-verbose, --debug
set loglevel to DEBUG
-w, --overwrite Overwrite the input html. Default = False, which means a new html is created withthe suffix
_clean
-f [PATH ...], --find_and_replace [PATH ...]
Define a list of key=value pairs to define string patterns you want to replace
--clear_find_and_replace_defaults
Clear the predefined find and replace patterns
- latexml_html_cleaner.main.main(args)[source]
Wrapper allowing
fib()to be called with string arguments in a CLI fashionInstead of returning the value from
fib(), it prints the result to thestdoutin a nicely formatted message.- Parameters:
args (List[str]) – command line parameters as list of strings (for example
["--verbose", "42"]).
- latexml_html_cleaner.main.parse_args(args)[source]
Parse command line parameters
- Parameters:
args (List[str]) – command line parameters as list of strings (for example
["--help"]).- Returns:
command line parameters namespace
- Return type:
- latexml_html_cleaner.main.parse_var(s)[source]
Parse a key, value pair, separated by ‘=’ That’s the reverse of ShellArgs.
- On the command line (argparse) a declaration will typically look like:
foo=hello
- or
foo=”hello world”
- latexml_html_cleaner.main.parse_vars(items)[source]
Parse a series of key-value pairs and return a dictionary