Skip to content

how to get page source as HTML (not as XML) #443

@christian-draeger

Description

@christian-draeger

Hello,
i am wondering if it is somehow possible to get the page source (after javascript has been executed) as HTML (like a browser would show it if i inspect a pages source with its developer console).

I know that i can convert a HtmlPage to XML like this:

a given html:

<div>
 <span id="dynamic"></span>
 <script>
     document.querySelector('#dynamic').innerHTML = "<span>dynamically added</span>";
 </script>
</div>
// kotlin example to parse HTML as string to HtmlPage object
val rendered: HtmlPage = WebClient(BrowserVersion.BEST_SUPPORTED).loadHtmlCodeIntoCurrentWindow(htmlFromAboveAsString)

// convert HtmlPage object to XML as string and print it to console
println(rendered.asXml())

which will lead to the following output:

<?xml version="1.0" encoding="UTF-8"?>
<html>
  <head/>
  <body>
    <div>
      <span id="dynamic">
        <span>
          dynamically added
        </span>
      </span>
      <script>
//<![CDATA[

        document.querySelector('#dynamic').innerHTML = "<span>dynamically added</span>";
    
//]]>
      </script>
    </div>
  </body>
</html>

but since this is xml (as the asXml() function promises^^) the string will diverge from what a browser would show during DOM inspection.
because the asXml() methods use-case is to create a valid XML, it adds a prolog that defines the XML version and the character encoding on top (<?xml version="1.0" encoding="UTF-8"?>) as well as wrapping the innerText of script tags with a CDATA block to not clash with potential valid XML tags (like in my example a text including things like <span>dynamically added</span>) and potentially doing even more things.

a real browser on the other hand would give me the actual html after rendering while having a look in its developer console, like this:

<html>
    <head></head>
    <body>
        <div>
            <span id="dynamic"><span>dynamically added</span></span>
            <script>
                document.querySelector('#dynamic').innerHTML = "<span>dynamically added</span>";
            </script>
         </div>
    </body>
</html>

Actual Question:

Is it possible to get a rendered html version as string instead of a html that has been converted to xml?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions