House of Fusion
Search over 2,500 ColdFusion resources here
  
Home of the ColdFusion Community

Mailing Lists
Home /  Groups /  ColdFusion Talk (CF-Talk)

scraping meta tags

  << Previous Post |  RSS |  Sort Oldest First |  Sort Latest First |  Subscribe to this Group Next >> 
Top  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Richard Steele
03/18/2010 06:50 AM

We'd like to pull meta tags from the home page of various websites. Here's how we'd like for this to work: 1. SQl table listing over 3,000 urls is queried. 2. Pull the meta tags and description from each of the home pages of these websites. 3. Insert these meta tags into a database. What's the best way to accomplish this? In particular, how do we scrape the meta tags using CF8? Thanks in advance.

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Paul Vernon
03/18/2010 07:02 AM

> What's the best way to accomplish this? In particular, how do we scrape > the meta tags using CF8? Here's how I'd approach it. 1. Write a spider using CFHTTP at its core. 2. Write a parser that can parse cfhttp.filecontent looking for the meta tags you're interested in 3. Create a loop that uses cfthread to call your spider and parses the response using the parser say 10 times per loop. (More if you have Enterprise). 4. Stick the parsed data in the DB and loop round to the next load from the list. Paul

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Dave Watts
03/18/2010 08:52 PM

----- Excess quoted text cut - see Original Post for more ----- The best way to accomplish this would probably be to use something other than CF, which is not intended for this kind of thing. There are all sorts of products, free and other, that can do individual parts of this, without being tied to the request/response model that CF is designed to work within. If I had to do this, I think I'd use Python to query the database for your list of URLs and write them to a file, then pass that file to wget to fetch the URLs, then use Python again to parse the metadata from the fetched URLs and write that to the database. Dave Watts, CTO, Fig Leaf Software http://www.figleaf.com/ http://training.figleaf.com/ Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on GSA Schedule, and provides the highest caliber vendor-authorized instruction at our training centers, online, or onsite.

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
denstar
03/18/2010 11:27 PM

On Thu, Mar 18, 2010 at 6:51 PM, Dave Watts wrote: ----- Excess quoted text cut - see Original Post for more ----- I'm really fond of web-harvest: http://web-harvest.sourceforge.net/ It's the cat's meow.  Even wrote a custom tag (Railo only ATM) for using it with CFML. :den -- True time is four-dimensional. Martin Heidegger


<< Previous Thread Today's Threads Next Thread >>

Search cf-talk

July 31, 2010

<<   <   Today   >   >>
Su Mo Tu We Th Fr Sa
         1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31