HTML Only vs Webpage Complete files
The HTML differs when saving with the "HTML Only" option versus the "Webpage, Complete" option in Chrome and Firefox. I modified the regex based on the patterns I was finding for the differences in the HTML and made a pull request (!1 (merged)).
For google files:
The parser is looking for the matching pattern /<base href="https:\/\/adssettings.google.com\/">/
. The complete files have that elsewhere and sometimes it has the leading and trailing <
and >
, but sometimes it doesn’t, depending on browser. I propose changing the regex to /base href="https:\/\/adssettings.google.com\/"/
to handle those differences.
For facebook files:
The parser is looking for the matching pattern /<html lang="[^"]*" id="facebook" class="[^"]*">/
. The complete files have a different ordering of the lang and id options. I think the most important is the id="facebook" to confirm it is a facebook file. I propose changing the regex to /<html .*id="facebook" class="[^"]*"/
. That way it will match any characters before the id="facebook", but still match the general structure and ensure id="facebook" is there.