The Importance of Semantic HTML

by

The vast majority of modern web developers would agree that semantic HTML is good HTML. HTML documents should have meaning, instead of being vehicles of presentation. In the past, web developers were more concerned with the appearance of the web site rather than its meaning. However, this philosophy led to many problems, especially with accessibility. Screen readers and HTML parsers have a hard time making any sense of non-semantic HTML. Putting presentation first usually leads to the abuse of HTML tags. <table> is meant to organize data, not dictate layout for a site. HTML now provides semantically neutral elements like <div> for layout styles. Using tags incorrectly confuses screen readers and HTML parsers, which makes the page much less accessible to users. Today, style information is encapsulated in a CSS document. This allows a developer to apply a simple modification like a color change to be propagated across hundreds of pages automatically. Stylesheets also contribute to a pages meaning. For example, it is obvious which HTML is semantic:

<i color="red">Don't do this!</i>

or

<em class="warning">Don't do this!</em>

A web designer can accomplish the same presentation as the former HTML with this style:

.warning {
    font-style: italic;
    color: red;
}

At work this past week I was writing a script to scrape some biological data returned from a website query. The script supplies the site with a small input, then counts the number of genes in a table on the page. Sounds easy enough. However, this website was an absolute nightmare from a developer's perspective!

The design of the page itself was pretty bare-bones. It had a header image, a few navigation links, a table that held the relevant data, and a footer with a few more links. However, this data was extremely annoying to parse. The page used a stylesheet, but the classes were used solely for decoration. For example, the table's rows were marked "odd" and "even" to alternate the colors of each row. Not a crime in itself, I thought that I could merely count the occurences of the "odd" and "even" classes on the page. This wasn't possible, because there were multiple invisible table rows that were still named "odd" and "even!"

I decided to try counting the number of genes by count the number of links in the table. There were two problems with this approach. First of all, the site used tables for its layout, so it was difficult to know if I was even in the correct table. Luckily the relevant table had its own class "restable." I counted my blessings that I would not have to work around the first issue. The second issue was that there was two links in every table row, one for the gene's name and a link to its sites in the untranslated region of its sequence. I couldn't directly divide by two because sometimes the links did not appear (of course). I simply refused to count the link if its text was equal to "Sites in UTR."

I thought that was the end of the confusion. To make things even worse, sometimes the site would not return a table at all! Instead, the table was replaced with a link to the page that I actually wanted to parse. I couldn't believe how this page operated! Parsing the site with the HTML parser included in the Python standard library was a nightmare. A summary of my script's logic in Python pseudocode:

if correct_table_is_present:
    if in_correct_table:
        if in_table_row:
            if tag_is_link:
                if link_text != 'Sites in UTR':
                    gene_count += 1

Incredibly hacky and ugly, but it got the job done. Just goes to show that most biologists aren't programmers ;). Thanks for reading, and I hope this tale of woe inspires you to write the right kind of HTML!