It’s quite common for a web page to be broken into sections with items appearing in each section.
The best way that I’ve found to approach this situation is to firstly break the page into sections, and then break those sections into items. So let’s have a look at that process as seen in this pipe.
As we can see, the kind developer of this page has made it nice and easy for us to split the page into sections.
The first rule extracts the section name from the content and creates a pair of div tags. The div has a class attribute that should have a value that doesn’t appear anywhere else on the page. So in item.section we now have something like “<div class="mysection">Sports</div>”. Next we need to find some portion of the html that occurs at least once for each item (and preferably only once). The html has to be exactly the same for each item. In this case we can identify “<td class="subject"” as a suitable target. So the second rule takes that string and globally prefixes it with the value in item.section. We now have the text of the section appearing in each item. Time to break out the items.
Nothing out of the ordinary here. Use a String Tokenizer to split up the items, and a Filter to get rid of any unwanted items. In this case the heading is going to be saved as a category element. So a Rename creates the category element.
We don’t want to be left with just the text of the heading at this stage, because we have some housekeeping to do, so the Regex rule leaves us with what we originally had in the section element, e.g. “<div class="mysection">Sports</div>”.
Finally, as far as this post is concerned, we can use a String Replace to remove the div that was inserted, and a Regex to leave us with just the text of the heading. The housekeeping might not be strictly necessary, but it could make life easier for later on, since we can refer to the original content of the web page without having to remember that somewhere in there is some added markup.