Home » Import and query HTML table data in SQL Server

Import and query HTML table data in SQL Server

by Vlad Drumea
0 comment

In this post I cover a way to import and query HTML table data in SQL Server using only T-SQL and no external tools.

Why?

For science!

But I also wanted to see if I can use T-SQL to parse some of the HTML report pages generated by PSBlitz.

In this case, I’m focusing on the index diagnosis page generated by running the non-stored procedure version of Brent Ozar’s sp_BlitzIndex that’s part of PSBlitz.

Getting things ready

There’s no initial setup part required here, all that’s needed is a SQL Server instance, and an HTML, containing at least one table, located in a directory where the SQL Server service account has read access.

Since I’m lazy efficient, I’ve just copied the BlitzIndex_4.html file, from one of my test runs of PSBlitz, to my instance’s backup directory.

If you want to do this from another directory, which I recommend for production and/or shared environments, you can use icacls to grant the service account read permissions on the file.

Don’t do silly stuff like adding the service account to the local Admins group or setting SQL Server to run as LocalSystem – see here why.

If you want to repeat the steps with the same data and file, you can find the same file I’ve used here.

For reference, this is what the first rows of the HTML file look like when viewed in the browser.

The first 7 rows plus headers of the Detailed Index Diagnosis page of a diagnostics report generated by PSBlitz SEO import query HTML SQL Server

Importing and querying HTML

And this is how the result set looks like. Just like what you’d see after running sp_BlitzIndex from SSMS.

Result set from SSMS showing the data in a tabular format like it would have originated from a table from from sp_BlitzIndex's output SEO import query HTML SQL Server

Breaking it down

I’ll skip the variable declaration since that’s self-explanatory.

The SELECT statement does the following:

  • Uses OPENROWSET combined with BULK to read the E:\VSQL\Backup\BlitzIndex_4.html file, and SINGLE_BLOB to treat it as VARBINARY and avoid any potential encoding issues (as opposed to SIGLE_CLOB or SINGLE_NCLOB which can conflict with some Windows encodings and cause issues on file import)
  • Applies REPLACE on the BulkColumn column returned by OPENROWSET to
    • Remove non-XML tags – in this case the link at the end of the file <a href="#top">Jump to top</a> that, when clicked in the browser, takes you back to the top of the report page; and the line break tag <br> which does not have a closing tag that XML would require
    • Fills empty table fields with a dummy value, in this case x, so that the end result won’t have values shifted to the left for records where SecretColumn, or other columns, would normally be empty.
  • And, finally, it converts the data to XML, but since HTML isn’t real XML, I use XML style 2 so that SQL Server does a bit of work to make it look like legit XML content.

Now for the XML querying part

I’ll be hones with you, dear reader, I suck at XML querying and I have to google stuff the same as everyone else.
So, I’ve used Alan Burstein’s reply from this SQLServerCentral forum post as a starting point.

I’ll do my best to explain what’s happening here:

  • Since the HTML table columns don’t really have names, but ordinal positions, I have to use their position in the table, so ./td/text())[1] represents column 1, in this case the Priority column
  • For now, xx is a dummy column name from the derived table t2
  • The value() method performs an XQuery (XML query) for every column specified and returns the vale as the specified SQL data type – in the case of xx.value('(./td/text())[1]', 'INT') it returns the text from the first <td> element as an INTEGER
  • FROM (VALUES(@XMLContent)) AS t1(x) creates a derived table with one column (x) and one row containing the XML content from the @XMLContent variable
  • CROSS APPLY x.nodes('//table[1]/tr[position()>1]') AS t2(xx)) – This line uses the CROSS APPLY operator along with the .nodes() method to parse the XML content and retrieve data from it.
    • It looks for the first <table> element in the HTML and then selects all <tr> elements (rows) that have a position greater than 1 (skipping the header row)
    • For each selected <tr> element, it applies the alias t2 and further extracts the data inside the <td> elements of that row using the xx.value(…) method as described in the third bullet point
  • The XMLToTableCTE common table expression that wraps around the XML query is there just to make filtering or any manipulation of the returned data easier

In summary, the query parses the provided XML content obtained from converting the HTML file, extracts data from the first table (excluding the header row), and presents the data in the SSMS result grid that we all know and love.

Conclusion

If T-SQL is your hammer of choice, then every file is a potential nail if it can be converted to usable XML.

You may also like

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

This site uses Akismet to reduce spam. Learn how your comment data is processed.