Archive for May, 2009

Web page section headers in items

May 21, 2009

It’s quite common for a web page to be broken into sections with items appearing in each section.

The best way that I’ve found to approach this situation is to firstly break the page into sections, and then break those sections into items. So let’s have a look at that process as seen in this pipe.

As we can see, the kind developer of this page has made it nice and easy for us to split the page into sections.

The first rule extracts the section name from the content and creates a pair of div tags. The div has a class attribute that should have a value that doesn’t appear anywhere else on the page. So in item.section we now have something like “<div class="mysection">Sports</div>”. Next we need to find some portion of the html that occurs at least once for each item (and preferably only once). The html has to be exactly the same for each item. In this case we can identify “<td class="subject"” as a suitable target. So the second rule takes that string and globally prefixes it with the value in item.section. We now have the text of the section appearing in each item. Time to break out the items.

Nothing out of the ordinary here. Use a String Tokenizer to split up the items, and a Filter to get rid of any unwanted items. In this case the heading is going to be saved as a category element. So a Rename creates the category element.

We don’t want to be left with just the text of the heading at this stage, because we have some housekeeping to do, so the Regex rule leaves us with what we originally had in the section element, e.g. “<div class="mysection">Sports</div>”.

Finally, as far as this post is concerned, we can use a String Replace to remove the div that was inserted, and a Regex to leave us with just the text of the heading. The housekeeping might not be strictly necessary, but it could make life easier for later on, since we can refer to the original content of the web page without having to remember that somewhere in there is some added markup.

Concatenating items in a feed

May 20, 2009

By which I mean lumping selected elements from all the items in a feed into a single item.

A word of warning first. This technique involves using a string of numbers in the form “0,1,2,3”, where the highest number will be the maximum (-1) number of items  possible in the feed. If you’re looking to lump together 50 items and aren’t willing to build the relevant string then this is not for you.

Let’s have a look at my pipe that does this.

Here we have a URL Input module, an Item Builder a Fetch Data module in a Loop and a Split. The main thing to notice here is that in the normal run of things I could just have used a URL Input module and a Fetch Data, but what I want is for my items to appear as a sub-element array.

If you want to have more than one element in the output, or you want to format the elements, then use a sub-pipe instead of my Fetch Data module to do that work.

Now let’s see what’s happening in the right-hand path of the Split.

We need the Sub-element so that we can then use the Count to output the number of items in the feed. If we have 2 items then the String Regex converts that number to “,2.*”. This is where that string of numbers I mentioned earlier comes in.

The first rule in this String Regex removes “,2” and everything after that, and the second rule wraps what is left inside “${stuff.” and “.title}”. Finally the String Replace replaces each of the commas with the string on the right. The outcome of all this is that we end up with a string like “${stuff.0.title}<br>${stuff.1.title}”. In case you weren’t aware, in the Regex module we can refer to elements of an item using $(element path excluding item.}. And now we can plug this into the Regex module in the left-hand path of the Split.

The grand result of this is that I can show off the first 2 titles of my blog posts.

I think that could be 3 titles fairly soon.

How long is my (piece of) string?

May 11, 2009

There’s no built-in module in Pipes that can count the number of characters in a string/element, and in the past I hadn’t been able to come up with a way to do this. But recently I hit on this method, which may be the strangest way you’ll ever see this function implemented.

In the right hand path of the Split module the Regex module replaces each character by “1,”. There is no significance to the 2 characters used.

the String Tokenizer splits the string into separate items using the second character of the replacement string, the comma in this case. All that is left then is to get a count of the elements and insert that figure into the Item Builder that is in the path on the left.

This pipe needs to be used as a sub-pipe. If the routine is used in a main pipe the String Tokenizer will produce a set of items based on the sum of the strings for all of the main items. Since it’s used as a sub-pipe it means that there is a limit of roughly 2,000 characters for the string.