Xpath in extract

we used to be able to use XPath to extract portions of a webpage into a file widget using syntax like this:

<arg id="extract">//div[@class='page-body']</arg>

and that would pull the contents of

<div class="page-body"></div>

into the file widget. That doesn’t seem to be working any more, and I don’t see any related errors in the browser console. Has something changed - this used to work…?

Thanks,

-= G =-

Hi G,

Sorry to hear this is giving you trouble – I don’t see any recent changes to the getExtract function that executes those file widget arg=extract functionality. We’ll try to reproduce this on our servers and will see what we find; apologies for the inconvenience in the meantime.

if it helps this is the page we’re trying to pull from:

Thanks,

-= G =-

Thanks, that’s helpful to know it’s a 3rd-party (not LW hosted) site being pulled from, in case firewalls or other request-blockers might be coming into play.

right - I checked for errors in the browser console log and didn’t see anything like that though…

-= G =-

All right, we’ve taken a deep dive on this – the long and short of it is, the (X)HTML of https://ogs.ny.gov/flags-new-york-state-buildings is malformed in a few ways, the sum total of which makes it impossible for the LiveWhale XML parser to locate the xpath.

If I extract just the inner bit and fix one orphaned </article> that was meant to be a </div>, it works in my isolated test case, but on the larger HTML page (c’mon, ny.gov!) there are too many parsing problems for LW’s scripted auto-repairs to grapple with.

To LW core, we’re going to add more descriptive failure cases in this case (when an extract argument is present but the source HTML cannot be parsed) so it’s clearer to logged-in editors what’s going on. Apologies for the inconvenience!

Karl:

Thanks so much for taking so much time with it and for the explanation. The additional messaging will be helpful in those cases where something like that happens, but in this case it seems like it is not a LW issue, so we’ll probably just have to let it go…

It was working ok at one point quite a while ago, so I wonder whether something changed on the source page…

Thanks,

-= G =-