The ability to harvest data from web pages using Yahoo! Query Language (YQL) Web Service is nothing short of inspired. If you have not taken YQL for a test drive, I would highly recommend you carve out an hour of time to play with it.

When building your YQL statement you will need to determine the correct XPath expression in order to extract the data from the web page of interest. Using firebug to find the Xpath on a webpage makes it much simpler to determine the needed XPath expression. However, there is one caveat! You need to remove the tbody components from the XPath expression as they are not recognized by YQL.

For example, let's say that you want to harvest the Florida Gator Football schedule from rivals.com.
  1. Navigate to the rivals.com web page, and open Firebug.

  2. Using the Firebug element inspection tool select the first row of the football schedule.


  3. In the Firebug source window highlight right click on the first table row and select Copy XPath from the menu.

    /html/body/div/table/tbody/tr/td/table[3]/tbody/tr[2]
    
  4. With both the web page URL and XPath expression you can build the YQL statement:
    select * from html
    where url='http://rivals.yahoo.com/ncaa/football/teams/ffa/schedule'
    and xpath='/html/body/div/table/tbody/tr/td/table[3]/tbody/tr[2]'
    
    However, when you enter this into the YQL Console the results are null.

  5. The fix is to remove the two tbody elements from the XPath expression
    '/html/body/div/table/tr/td/table[3]/tr[2]'
    
  6. Paste the updated YQL statement into the YQL Console and you have the desired results.
    select * from html
    where url='http://rivals.yahoo.com/ncaa/football/teams/ffa/schedule'
    and xpath='/html/body/div/table/tr/td/table[3]/tr[2]'
    

 Enjoy, and happy hack'in!