|
Mailing Lists
|
Home /
Groups /
ColdFusion Talk (CF-Talk)
scraping meta tags
We'd like to pull meta tags from the home page of various websites.Richard Steele 03/18/10 06:50 A > What's the best way to accomplish this? In particular, how do we scrapePaul Vernon 03/18/10 07:02 A > We'd like to pull meta tags from the home page of various websites.Dave Watts 03/18/10 08:52 P On Thu, Mar 18, 2010 at 6:51 PM, Dave Watts wrote:denstar 03/18/10 11:27 P We'd like to pull meta tags from the home page of various websites. Here's how we'd like for this to work: 1. SQl table listing over 3,000 urls is queried. 2. Pull the meta tags and description from each of the home pages of these websites. 3. Insert these meta tags into a database. What's the best way to accomplish this? In particular, how do we scrape the meta tags using CF8? Thanks in advance. > What's the best way to accomplish this? In particular, how do we scrape > the meta tags using CF8? Here's how I'd approach it. 1. Write a spider using CFHTTP at its core. 2. Write a parser that can parse cfhttp.filecontent looking for the meta tags you're interested in 3. Create a loop that uses cfthread to call your spider and parses the response using the parser say 10 times per loop. (More if you have Enterprise). 4. Stick the parsed data in the DB and loop round to the next load from the list. Paul ----- Excess quoted text cut - see Original Post for more ----- The best way to accomplish this would probably be to use something other than CF, which is not intended for this kind of thing. There are all sorts of products, free and other, that can do individual parts of this, without being tied to the request/response model that CF is designed to work within. If I had to do this, I think I'd use Python to query the database for your list of URLs and write them to a file, then pass that file to wget to fetch the URLs, then use Python again to parse the metadata from the fetched URLs and write that to the database. Dave Watts, CTO, Fig Leaf Software http://www.figleaf.com/ http://training.figleaf.com/ Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on GSA Schedule, and provides the highest caliber vendor-authorized instruction at our training centers, online, or onsite. On Thu, Mar 18, 2010 at 6:51 PM, Dave Watts wrote: ----- Excess quoted text cut - see Original Post for more ----- I'm really fond of web-harvest: http://web-harvest.sourceforge.net/ It's the cat's meow. Even wrote a custom tag (Railo only ATM) for using it with CFML. :den -- True time is four-dimensional. Martin Heidegger
|
July 31, 2010
|
Latest Fusion Authority Articles
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||