House of Fusion
Search over 2,500 ColdFusion resources here
  
Home of the ColdFusion Community

Mailing Lists
Home /  Groups /  ColdFusion Talk (CF-Talk)

Big XML files processing Really s-l-o-w Solution?

  << Previous Post |  RSS |  Tree View |  Sort Oldest First |  Subscribe to this Group Next >> 

Big XML files processing Really s-l-o-w. Solution?

Matt, Mark A. Kruger 02/19/2011 10:45 PM
Nice... Rick Faircloth 02/19/2011 09:32 PM
Nice report!  I'll try and remember that the next time I have to parse Matt Quackenbush 02/19/2011 09:21 PM
Here's the update: Matt Robertson 02/19/2011 09:06 PM
On Thu, Feb 17, 2011 at 8:56 AM, Rick Faircloth wrote: Matt Robertson 02/17/2011 10:51 PM
Good to know! Rick Faircloth 02/17/2011 11:56 AM
WE've used this approach as well... seems like a kludge but it actually Mark A. Kruger 02/17/2011 11:03 AM
In our company I found a solution to parse big XML files in ColdFusion Marco Antonio C. Santos 02/17/2011 10:54 AM
No problem... I haven't done much with XML, myself, so I thought Rick Faircloth 02/17/2011 10:31 AM
> Perhaps this will be of some help. Matt Robertson 02/17/2011 09:43 AM
Hi, Matt... Rick Faircloth 02/16/2011 11:09 PM
> Would it help if you split the file into individual records before Matt Robertson 02/16/2011 10:18 PM
> I am tasked with doing an import on a file whose size can essentially Jochem van Dieten 02/16/2011 06:49 PM
Yeah I have seen that, but I'm up against a time constraint and there Matt Robertson 02/16/2011 02:05 PM
Matt, as you want a non CF solution, you might be best posting elsewhere to Russ Michaels 02/16/2011 02:05 PM
Hi, Matt... Rick Faircloth 02/16/2011 01:10 PM
Hey everybody, we're in a bind here and I pitched the idea to the Matt Robertson 02/16/2011 12:22 PM
Very interesting stuff, guys.  Very much appreciated.  Matt Robertson 02/14/2011 07:52 PM
That is probably the solution I was thinking of too, I expect Ben or someone Russ Michaels 02/14/2011 07:47 PM
That actually looks like a pretty nice solution if bulkinsert isn't an Michael Grant 02/14/2011 07:33 PM
Processing the XML with Java rather than CFML will speed things up. Russ Michaels 02/14/2011 07:30 PM
When in doubt, ask Ben Nadel(or check his blog)... Words to live by Alan Rother 02/14/2011 07:26 PM
On Mon, Feb 14, 2011 at 7:07 PM, Matt Robertson Casey Dougall 02/14/2011 07:16 PM
> Either that, or start looking at something like StaX - Matt Robertson 02/14/2011 07:09 PM
> > you can make use of BulkInsert Michael Grant 02/14/2011 07:09 PM
Client is running mySQL, But a SQL Server could probably be arranged Matt Robertson 02/14/2011 07:07 PM
----- Excess quoted text cut - see Original Post for more ----- Casey Dougall 02/14/2011 07:03 PM
Either that, or start looking at something like StaX - Mark Mandel 02/14/2011 06:27 PM
On Mon, Feb 14, 2011 at 5:44 PM, Matt Robertson Casey Dougall 02/14/2011 06:25 PM
My question is how do I reduce processing time... is there anything I Matt Robertson 02/14/2011 05:45 PM

02/19/2011 10:45 PM
Author: Mark A. Kruger Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342453 Matt, Very cool - an innovative approach I would not have thought of. Of course we all know that Jochem's a certified genius :) -Mark Mark A. Kruger, MCSE, CFG (402) 408-3733 ext 105 Skype: markakruger www.cfwebtools.com www.coldfusionmuse.com www.necfug.com Here's the update: Jochem's StAX-based solution worked so well -- and it allowed me to re-use existing code since I was already processing xml -- that I didn't bother to consider testing the xml2csv utility. Original code processing a 45mb file yielded an insert time of 90 seconds per record with total file processing time being a projected 66 hours. Using code that incorporated xmlSplitter.cfc, the routine created 2682 discrete xml files in approximately 45 seconds.  From there, the insert loop did its work pulling out xml and pouring it into roughly 100 db table fields at ... 20 records per second. Total time to read in the 45mb xml file, create the discrete files, read them, insert them into the db and delete them (one at a time as I went along in the loop) was 192250ms. A bit more than three minutes. Thats an acceptable level of improvement. I owe you one, Jochem. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/19/2011 09:32 PM
Author: Rick Faircloth Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342452 Nice... Here's the update: Jochem's StAX-based solution worked so well -- and it allowed me to re-use existing code since I was already processing xml -- that I didn't bother to consider testing the xml2csv utility. Original code processing a 45mb file yielded an insert time of 90 seconds per record with total file processing time being a projected 66 hours. Using code that incorporated xmlSplitter.cfc, the routine created 2682 discrete xml files in approximately 45 seconds.  From there, the insert loop did its work pulling out xml and pouring it into roughly 100 db table fields at ... 20 records per second. Total time to read in the 45mb xml file, create the discrete files, read them, insert them into the db and delete them (one at a time as I went along in the loop) was 192250ms. A bit more than three minutes. Thats an acceptable level of improvement. I owe you one, Jochem. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/19/2011 09:21 PM
Author: Matt Quackenbush Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342451 Nice report!  I'll try and remember that the next time I have to parse huge XML files.  :-)
02/19/2011 09:06 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342450 Here's the update: Jochem's StAX-based solution worked so well -- and it allowed me to re-use existing code since I was already processing xml -- that I didn't bother to consider testing the xml2csv utility. Original code processing a 45mb file yielded an insert time of 90 seconds per record with total file processing time being a projected 66 hours. Using code that incorporated xmlSplitter.cfc, the routine created 2682 discrete xml files in approximately 45 seconds.  From there, the insert loop did its work pulling out xml and pouring it into roughly 100 db table fields at ... 20 records per second. Total time to read in the 45mb xml file, create the discrete files, read them, insert them into the db and delete them (one at a time as I went along in the loop) was 192250ms. A bit more than three minutes. Thats an acceptable level of improvement. I owe you one, Jochem. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/17/2011 10:51 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342417 On Thu, Feb 17, 2011 at 8:56 AM, Rick Faircloth wrote: > > Good to know! Indeed.  Thanks to Marco and Mark.  I plan on trying out both this and Jochem's solution starting maybe as soon as tomorrow and certainly through the long U.S. holiday weekend.  Maybe try both and see which is faster.  I'll post back the results. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/17/2011 11:56 AM
Author: Rick Faircloth Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342388 Good to know! In our company I found a solution to parse big XML files in ColdFusion applications: 1) using xml2csv(http://www.a7soft.com/xml2csv.html), a very fast application to convert XML to csv file; 2) with SQL Server BULK INSERT to insert in our database Using this approach reduced time in 100 times. Very fast. Cheers Marco Antonio On Thu, Feb 17, 2011 at 1:30 PM, Rick Faircloth <Rick@whitestonemedia.com>wrote: ----- Excess quoted text cut - see Original Post for more -----
02/17/2011 11:03 AM
Author: Mark A. Kruger Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342383 WE've used this approach as well... seems like a kludge but it actually works really really well. MS Text driver and DTS (or SSIS) are exceptionally fast for importing textual data. We have a process that consumes 350,000 rows of data in about 5-8 seconds... and does it every 2 minutes. -Mark In our company I found a solution to parse big XML files in ColdFusion applications: 1) using xml2csv(http://www.a7soft.com/xml2csv.html), a very fast application to convert XML to csv file; 2) with SQL Server BULK INSERT to insert in our database Using this approach reduced time in 100 times. Very fast. Cheers Marco Antonio On Thu, Feb 17, 2011 at 1:30 PM, Rick Faircloth <Rick@whitestonemedia.com>wrote: ----- Excess quoted text cut - see Original Post for more -----
02/17/2011 10:54 AM
Author: Marco Antonio C. Santos Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342382 In our company I found a solution to parse big XML files in ColdFusion applications: 1) using xml2csv(http://www.a7soft.com/xml2csv.html), a very fast application to convert XML to csv file; 2) with SQL Server BULK INSERT to insert in our database Using this approach reduced time in 100 times. Very fast. Cheers Marco Antonio On Thu, Feb 17, 2011 at 1:30 PM, Rick Faircloth <Rick@whitestonemedia.com>wrote: ----- Excess quoted text cut - see Original Post for more -----
02/17/2011 10:31 AM
Author: Rick Faircloth Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342381 No problem... I haven't done much with XML, myself, so I thought this presented a good opportunity to learn something new, too. Let us know how the solution you're working on pans out. > Perhaps this will be of some help. Thanks Rick,  Interesting stuff.  I don't work with xml so much so a lot of this is new, and I hate new :-).  I haven't been a full-time programmer for a few years now so I haven't kept up with everything like I would have otherwise. I have a business trip today so I have to set this aside but I'm going to run Jochem's solution - which is presented darn near turnkey - as soon as I can tomorrow and see what happens. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/17/2011 09:43 AM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342379 > Perhaps this will be of some help. Thanks Rick,  Interesting stuff.  I don't work with xml so much so a lot of this is new, and I hate new :-).  I haven't been a full-time programmer for a few years now so I haven't kept up with everything like I would have otherwise. I have a business trip today so I have to set this aside but I'm going to run Jochem's solution - which is presented darn near turnkey - as soon as I can tomorrow and see what happens. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/16/2011 11:09 PM
Author: Rick Faircloth Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342361 Hi, Matt... I've been tinkering with parsing the xml file and, though I haven't developed a full solution, I've made some progress that I thought I'd share. I created a MySQL 5 database called, "xmlTest". I created a MySQL 5 database table called, "xmlBlob". Fields:   - rowID (int, not null, auto increment)   - xmlContent (blob)   - comment (varchar, 200) I then ran this code: <cfset filePath = #expandPath('data.xml')#> <cfoutput>filePath = #filePath#</cfoutput> (checking accuracy) <cfquery name="processXML" datasource="xmlTest">   select    xmlContent   from      xmlBlob   limit      1   into      @xml </cfquery> Data extraction method 1: <cfquery name="extractData01" datasource="xmlTest">   select    extractValue(@xml, '//dataRecord[1]/recordID[1]') recordID </cfquery> <p>cfdump toString(extractData01.recordID) = <cfdump var="#toString(extractData01.recordID)#"></p> Data extraction method 2: <cfquery name="extractData02" datasource="xmlTest">   select    extractValue(b.xmlContent, '//dataRecord/recordID') recordID,         extractValue(b.xmlContent, '//dataRecord/year') year,         extractValue(b.xmlContent, '//dataRecord/make') make,         extractValue(b.xmlContent, '//dataRecord/model') model            from      xmlBlob b    </cfquery> <cfoutput>extractData02.recordCount = #extractData02.recordCount#</cfoutput><br> <br> <cfoutput query="extractData02"> #toString(extractData02.recordID)#<br> #toString(extractData02.year)#<br> #toString(extractData02.make)#<br> #toString(extractData02.model)#<br> <br> </cfoutput> ------------------------------------------------------- When the code above is run in a browser, it outputs: cfdump toString(extractData01.recordID) = 123 extractData02.recordCount = 1 123 124 1980 1981 Trabant Ford TurboXL TurboDL ------------------------------------------------------- I was just to the point of figuring out how to get the two records (I modified your data.xml file example so there would be two records) to show separately, instead of the data being in a list for the fields, when I saw your post and thought I'd go ahead and respond so you could see if this might be an approach that would work for you. In the data extraction method 1 above, the idea is to insert the xml file (which might be too large for this?) into a blob field in a MySQL table, then read the blob field and put the xml data file into a variable, '@xml', then run a cfquery against the variable. Data extraction method 2 skips reading the xml into a variable and queries the blob field directly.  This method uses the MySQL 'xpath_expression' syntax to access data in multi-level xml data files. With xpath, you can go as deeply into the levels as needed. Like I said, this is just the result of tinkering around with what I've been finding and attempting to get some usable output. Perhaps this will be of some help. Rick > Would it help if you split the file into individual records before > processing them? I think it would help immensely I think.  I'd prefer to fight my way thru this in CF if at all possible and things are just complex enough that I may need to go the route of stax.  Outside my comfort zone in terms of installing it on the server but I can likely figure it out. This morning I investigated the possibility of dropping into java and reading one line at a time, parsing out the text in that line... Its what I do in this file's CSV-format predecessor.   But there are no CR's or LF's in the file so that is out. Thanks very much for that post and your thoughts. Incidentally if anyone is looking for an xml reader/editor that can almost instantly read these kinds of moster files, check out the free firstobject xml editor. http://www.firstobject.com/ -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/16/2011 10:18 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342360 > Would it help if you split the file into individual records before > processing them? I think it would help immensely I think.  I'd prefer to fight my way thru this in CF if at all possible and things are just complex enough that I may need to go the route of stax.  Outside my comfort zone in terms of installing it on the server but I can likely figure it out. This morning I investigated the possibility of dropping into java and reading one line at a time, parsing out the text in that line... Its what I do in this file's CSV-format predecessor.   But there are no CR's or LF's in the file so that is out. Thanks very much for that post and your thoughts. Incidentally if anyone is looking for an xml reader/editor that can almost instantly read these kinds of moster files, check out the free firstobject xml editor. http://www.firstobject.com/ -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/16/2011 06:49 PM
Author: Jochem van Dieten Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342353 > I am tasked with doing an import on a file whose size can essentially > be unlimited.  We've been able to handle in the 10-15mb range but it > recently ballooned to 100 mb, and its going to get larger.  Processing > time seems to be about 66 hours for a 45 mb file and thats a disaster. Would it help if you split the file into individual records before processing them? http://jochem.vandieten.net/2011/02/17/splitting-large-xml-files-with-coldfusion-and-stax/ Jochem -- Jochem van Dieten http://jochem.vandieten.net/
02/16/2011 02:05 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342342 Yeah I have seen that, but I'm up against a time constraint and there are a number of issues I don't have my head wrapped around, and the client needs a solution asap given that he has a daily feed encompassing thousands of records and dozens of clients ... that has now been dead about 10 days. I'm sure I could figure it out with enough time, but the client would be better served if someone who actually has done this can hit the ground running on it. BUT with this whole thing in mind, there are some things I don't get about this whole load_file() thing that maybe you or someone else can answer.  I get how the file is pulled into memory, but how can a multi-level file get pulled in? Hopefully this excerpt (and its only a fragment) will explain my problem. - I don't need the header sectioon and want to skip over that. - The dataArea contains the line item records I want to pull in. - I can write CF code to pull the price and currency out of the   pricing section, but how would mySQL do something like that   where the xml design was made as complicated as humanly   possible? (this format is StarStandard.org-based, btw) - Note the ImageAttachmentExtended records.  There can   be any number of those, or none.  A CFloop works great but ... - the VehicleRemarketingBoat section can have any number   of line items *or groups* and they vary from record to record.   The groups can have more than one entry in them, or not.   Here again CF gives me tools to deal with this. I just don't see how something like this can be dumped into a db at all.  I'm assuming its inexperience with the task.  Or is this xml a bit beyond that type of approach? <headerSection>     <headerField1>blah</headerField1>     <headerField2>blah</headerField2>     <headerField3>yoo</headerField3>     <headerField4>hoo</headerField4>     <headerField5>woof</headerField5> </headerSection> <dataArea>     <dataRecord>         <recordID>123</recordID>         <year>1980</year>         <make>Trabant</make>         <model>TurboXL</model>         <pricing>             <hidePrice>false</hidePrice>             <hidePrice>8</hidePrice>             <Price>                 <ChargeAmount currencyID='USD'>99500</ChargeAmount>             </Price>         </pricing>         <Location>             <LocationAddress>                 <CityName>Annapolis</CityName>                 <CountryID>US</CountryID> <StateOrProvinceCountrySubDivisionID>MD</StateOrProvinceCountrySubDivisionID>             </LocationAddress>         </Location>         <ImageAttachmentExtended>             <ID>46795014</ID>             <URI>http://img.arf.com/images/1/70/33/3197033_0_080220111124_1.jpg</URI>;             <ImageWidthMeasure unitCode='pixel'>412</ImageWidthMeasure>             <ImageHeightMeasure unitCode='pixel'>471</ImageHeightMeasure>              <ImageLastModifiedDateTime>2011-02-08T11:24:00-08:00</ImageLastModifiedDateTime>             <UsagePreference>                 <PriorityRankingNumeric>1</PriorityRankingNumeric>             </UsagePreference>         </ImageAttachmentExtended>         <ImageAttachmentExtended>             <ID>123456</ID>             <URI>http://img.arf.com/images/1/70/33/3197033_0_080220111124_2.jpg</URI>;             <ImageWidthMeasure unitCode='pixel'>412</ImageWidthMeasure>             <ImageHeightMeasure unitCode='pixel'>471</ImageHeightMeasure>              <ImageLastModifiedDateTime>2011-02-08T11:24:00-08:00</ImageLastModifiedDateTime>             <UsagePreference>                 <PriorityRankingNumeric>2</PriorityRankingNumeric>             </UsagePreference>         </ImageAttachmentExtended>         <ImageAttachmentExtended>             <ID>987654</ID>             <URI>http://img.arf.com/images/1/70/33/3197033_0_080220111124_3.jpg</URI>;             <ImageWidthMeasure unitCode='pixel'>412</ImageWidthMeasure>             <ImageHeightMeasure unitCode='pixel'>471</ImageHeightMeasure>              <ImageLastModifiedDateTime>2011-02-08T11:24:00-08:00</ImageLastModifiedDateTime>             <UsagePreference>                 <PriorityRankingNumeric>1</PriorityRankingNumeric>             </UsagePreference>         </ImageAttachmentExtended>         <VehicleRemarketingBoat>             <MakeString>Starratt and Jenks</MakeString>             <ModelYear>1979</ModelYear>             <SaleClassCode>Used</SaleClassCode>             <Model>Morgan 45</Model>             <BoatLengthGroup>                 <BoatLengthCode>Nominal Length</BoatLengthCode>                 <BoatLengthMeasure unitCode='feet'>45</BoatLengthMeasure>             </BoatLengthGroup>             <BoatLengthGroup>                 <BoatLengthCode>Length At Water Line</BoatLengthCode>                 <BoatLengthMeasure unitCode='feet'>31.42</BoatLengthMeasure>             </BoatLengthGroup>             <BoatLengthGroup>                 <BoatLengthCode>Length Overall</BoatLengthCode>                 <BoatLengthMeasure unitCode='feet'>45</BoatLengthMeasure>             </BoatLengthGroup>             <BeamMeasure unitCode='feet'>11</BeamMeasure>             <DraftMeasureGroup>                 <DraftMeasure unitCode='feet'>6.08</DraftMeasure>                 <BoatDraftCode>Max Draft</BoatDraftCode>             </DraftMeasureGroup>             <DisplacementMeasure unitCode='pound'>25000</DisplacementMeasure>             <BoatCategoryCode>Sail</BoatCategoryCode>             <BoatClassGroup>                 <BoatClassCode>Cruisers</BoatClassCode>                  <PrimaryBoatClassIndicator>true</PrimaryBoatClassIndicator>             </BoatClassGroup>             <GeneralBoatDescription>One Big Boat</GeneralBoatDescription>             <BuilderName>Starratt and Jenks</BuilderName>             <DesignerName>Charles Morgan</DesignerName>             <BoatName>Inventions</BoatName>             <Hull>                  <BoatHullMaterialCode>Fiberglass</BoatHullMaterialCode>             </Hull>             <BallastWeightMeasure unitCode='pound'>12500</BallastWeightMeasure>         </VehicleRemarketingBoat>     </dataRecord> </dataArea> ----- Excess quoted text cut - see Original Post for more -----
02/16/2011 02:05 PM
Author: Russ Michaels Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342343 Matt, as you want a non CF solution, you might be best posting elsewhere to find a database guru who could do it all directly on the DB. If no-one here pipes up then try www.odesk.com On Wed, Feb 16, 2011 at 6:09 PM, Rick Faircloth <Rick@whitestonemedia.com>wrote: ----- Excess quoted text cut - see Original Post for more -----
02/16/2011 01:10 PM
Author: Rick Faircloth Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342338 Hi, Matt... I know you might have looked at this link at dev.mysql.com for a solution to processing your data feed, but I thought I'd bring it up, just in case you haven't seen it.  Looks like it might be right up your alley. http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html#xml-5. 1-in-and-out While I didn't have to process xml files, I do have to process a daily data feed of real estate data in comma-delimited files and insert that info into a MySQL 5 database. At first, I tried processing the files with CF, but it was a *really* slow process using CF, taking almost a minute to process some files.  Finally, someone on this list suggested I take a look at the MySQL commands for loading data from files, such as, in your case, "Load_File()", which opens an entire XML document, makes it available as a string, and inserts this string into a table column.  In my case, using MySQL's load data infile, the time was reduced to less than a second. Check out the page above for details on usage and syntax. hth, Rick Hey everybody, we're in a bind here and I pitched the idea to the client of bringing in a ringer.  Is there someone out there who is interested in building a direct-to-SQL Server or direct-to-mySQL-based solution? Solution would have to, on a scheduled basis, grab a file in a location on the server on its own without CF and process it (we have stuff that retrieves/places the file).  Current file size is about 50 mb but must be expected to go to around 100.  At 42mb using just CF I am at about 70 seconds per record loop.  We need a *significant* improvement in performance.  Work would be on a dedicated Windows server.  Present box is a fairly powerful Crystaltech Win2003 Server w/4GB RAM and a Xeon 2.8 processor w/4 cores.  We'd consider a Win2008, 64-bit installation to improve speed.  Looking at a 24-core server at Viviotech w/16gb as a hi-horsepower alternative. Our feed partner is probably going to be able to use a different method of delivery that will reduce file size, but for now we need to plan for the worst and move on it. Anyone interested?  You can email me at my for-reals email at matt AT mysecretbase DOT com. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/16/2011 12:22 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342336 Hey everybody, we're in a bind here and I pitched the idea to the client of bringing in a ringer.  Is there someone out there who is interested in building a direct-to-SQL Server or direct-to-mySQL-based solution? Solution would have to, on a scheduled basis, grab a file in a location on the server on its own without CF and process it (we have stuff that retrieves/places the file).  Current file size is about 50 mb but must be expected to go to around 100.  At 42mb using just CF I am at about 70 seconds per record loop.  We need a *significant* improvement in performance.  Work would be on a dedicated Windows server.  Present box is a fairly powerful Crystaltech Win2003 Server w/4GB RAM and a Xeon 2.8 processor w/4 cores.  We'd consider a Win2008, 64-bit installation to improve speed.  Looking at a 24-core server at Viviotech w/16gb as a hi-horsepower alternative. Our feed partner is probably going to be able to use a different method of delivery that will reduce file size, but for now we need to plan for the worst and move on it. Anyone interested?  You can email me at my for-reals email at matt AT mysecretbase DOT com. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/14/2011 07:52 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342231 Very interesting stuff, guys.  Very much appreciated.  Should have come here before I did the project as opposed to afterward.  You'd think I'd know better by now. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/14/2011 07:47 PM
Author: Russ Michaels Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342229 That is probably the solution I was thinking of too, I expect Ben or someone just wrapped it up into a UDF or CFC or posted it on one of the aforementioned sites. Russ ----- Excess quoted text cut - see Original Post for more -----
02/14/2011 07:33 PM
Author: Michael Grant Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342227 That actually looks like a pretty nice solution if bulkinsert isn't an option. ----- Excess quoted text cut - see Original Post for more -----
02/14/2011 07:30 PM
Author: Russ Michaels Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342226 Processing the XML with Java rather than CFML will speed things up. check riaforge.org or cflib.org and the adobe exchange, somewhere there is a XML parser that does just this and may give you the boost you want. Russ On Tue, Feb 15, 2011 at 12:15 AM, Casey Dougall < casey@uberwebsitesolutions.com> wrote: ----- Excess quoted text cut - see Original Post for more -----
02/14/2011 07:26 PM
Author: Alan Rother Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342225 When in doubt, ask Ben Nadel(or check his blog)... Words to live by I had the same issue, huge variable XML file, lots of variations, had a hell of a time making it work in SQL Server due to the complexity Used this, works amazingly well, all things considered, executes very quickly http://www.bennadel.com/blog/1345-Ask-Ben-Parsing-Very-Large-XML-Documents-In-ColdFusion.htm =] On Mon, Feb 14, 2011 at 5:15 PM, Casey Dougall < casey@uberwebsitesolutions.com> wrote: ----- Excess quoted text cut - see Original Post for more -----
02/14/2011 07:16 PM
Author: Casey Dougall Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342223 On Mon, Feb 14, 2011 at 7:07 PM, Matt Robertson <websitemaker@gmail.com>wrote: > > Client is running mySQL, But a SQL Server could probably be arranged > if we have to.  Kind of a painful investment for them, though. > > Well seems there is support for this in mysql as well but examples use something simular to BulkInsert where the xml file is on the MySql Server. http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html
02/14/2011 07:09 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342221 > Either that, or start looking at something like StaX - > http://stax.codehaus.org/Home thx for the tip on Stax but it frankly looks a bit out of my league for this project. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/14/2011 07:09 PM
Author: Michael Grant Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342222 > > you can make use of BulkInsert +1. I love love love bulkinsert.
02/14/2011 07:07 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342220 Client is running mySQL, But a SQL Server could probably be arranged if we have to.  Kind of a painful investment for them, though. I was kind of hoping you guys would find a bonehead mistake that would magically clear this up.  I'm out of CF-based ideas. Speaking of which, I tried shortening those var names with some replace() statements and it had zero effect.  Didn't expect it would but wanted to throw it up against the wall. Damn shame I can't drop to java and do a line-by-line read like I do with CSV files :-( -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
02/14/2011 07:03 PM
Author: Casey Dougall Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342218 ----- Excess quoted text cut - see Original Post for more ----- yeah, I don't know, still seems better equipped for SQL server if you can do it there. <cfstoredproc procedure="spBIG_XML_Insert" datasource="#request.DataSource#" returncode="Yes">   <cfprocparam type="In" cfsqltype="CF_SQL_VARCHAR" variable="mydoc" value="#r.SXMLRESPONSE#"> </cfstoredproc> And then the trimmed down Stored Procedure.... ALTER PROCEDURE [dbo].[spBIG_XML_Insert] @mydoc     xml, AS declare @hdoc int -- Create an internal representation of the XML document. EXEC sp_xml_preparedocument @hdoc OUTPUT, @mydoc, '<soap:Envelope     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"     xmlns:xsd="http://www.w3.org/2001/XMLSchema"     xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"     xmlns:c="urn:Scores/Services/v2/ScoresUpdate" />' -- Now do the insert INSERT INTO tblScoreUpdates SELECT     MemberID                 = x.MemberID,     ReportDate                = x.ReportDate,     Score                       = x.Score,     DateAdded               = GetDate() FROM OPENXML (@hdoc, 'soap:Envelope/soap:Body/c:LatestScoresResponse/c:parameters/c:Scores',2) WITH     (   MemberID     varchar(10)   '@MemberID',         ReportDate     varchar(25)  '@ReportDate',         Score            int               '@Score') as x WHERE         x.MemberID NOT IN (SELECT t.MemberID                                          FROM  tblScoreUpdates t With (NoLock)                                          WHERE t.MemberID        = x.MemberID) ; --Do another query here if you want, --remove the document from memory EXEC sp_xml_removedocument @hDoc And if you needed something from a level up... just dot notation back to it. LogID = int  '../@LogID which would grab the logID from Parameters...
02/14/2011 06:27 PM
Author: Mark Mandel Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342216 Either that, or start looking at something like StaX - http://stax.codehaus.org/Home Pulling that large a file in memory is going to suck a lot, no matter what you do. Mark ----- Excess quoted text cut - see Original Post for more -----
02/14/2011 06:25 PM
Author: Casey Dougall Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342215 On Mon, Feb 14, 2011 at 5:44 PM, Matt Robertson <websitemaker@gmail.com>wrote: ----- Excess quoted text cut - see Original Post for more ----- Are you using MS SQL 2005? Because throwing the XML file directly at SQL server works a 1000% better. I've basically given up on processing large XML files directly with coldfusion. IF you can save these xml files on the SQL server you can make use of BulkInsert but that needs to read the file from that server. OR just setup a stored procedure, send your xml over and let it do it's thing. http://msdn.microsoft.com/en-us/library/ms345117%28v=SQL.90%29.aspx
02/14/2011 05:45 PM
Author: Matt Robertson Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342213 My question is how do I reduce processing time... is there anything I could be doing better? I am tasked with doing an import on a file whose size can essentially be unlimited.  We've been able to handle in the 10-15mb range but it recently ballooned to 100 mb, and its going to get larger.  Processing time seems to be about 66 hours for a 45 mb file and thats a disaster. For a 14mb file its about 90 minutes. Whats happening is this:  CF is looping over a very large number of records that are in themselves fairly complex.  The more records there are, the longer the loop over them takes.  More records + more time per record = a seemingly geometric increase in processing time. The file is read in like so. <!--- read the imported file ---> <cflock     name="mates_import"     type="EXCLUSIVE"     timeout="10">     <cffile         action="READ"         file="#variables.mates.srcFile#"         variable="x"> </cflock> <cfset x = replaceNoCase(x,"StateOrProvinceCountrySub-DivisionID","StateProvince","ALL")> <!--- turn the file into a coldfusion xml object ---> <cfset x=ltrim(trim(x))> <cfset x=XMLParse(x)> The above takes only a few seconds.  No problem there. Next I have to read in some header info <cfscript>: header.logicalID=x.processvehicleremarketing.applicationarea.sender.logicalID.xmltext; header.taskID=x.processvehicleremarketing.applicationarea.sender.taskID.xmltext; header.BODID=x.processvehicleremarketing.applicationarea.BODID.xmltext; header.created=x.processvehicleremarketing.applicationarea.CreationDateTime.xmltext; // ... // and here comes the node with all of the line items in it I'll have to loop over.  This is where all of the speed issues have been traced to: variables.mates.boatArrayLen=arrayLen(x.processvehicleremarketing.ProcessVehicleRemarketingDataArea.VehicleRemarketing); </cfscript> knowing the array length I can use CFLOOP to look over it and pull data in where it is then stored in a db. <cfloop     from="1"     to="#variables.mates.boatArrayLen#"     index="i">     <cfscript>     listings_mates.inHouseListingNumber="M-" & x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingHeader.documentIdentificationGroup.documentIdentification.documentID.xmltext;      listings_mates.price=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.pricingABIE.price.chargeAmount.xmltext;      listings_mates.currency=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.pricingABIE.price.chargeAmount.xmlAttributes.currencyID;     // there can be more than one of these items so run a loop inside of the loop      variables.mates.engineArrayLen=arrayLen(x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.VehicleRemarketingEngineLineItem);     ii=0;     do {         ii=ii+1;          listings_mates.engineDesc=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.VehicleRemarketingEngineLineItem[ii].VehicleRemarketingEngine.modelDescription.xmltext;          listings_mates.engHrs=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.VehicleRemarketingEngineLineItem[ii].VehicleRemarketingEngine.totalEngineHoursNumeric.xmltext;     } while (ii LTE variables.mates.engineArrayLen);     ...     </cfscript> ... </cfloop> And so on.  A hundred or so fields and a dozen or so loops inside the main loop, along with a loop or two inside of those.  So the very long variable names get even longer. As you can see I am pouring the data into a struct, and when done, I insert it as a db record. Anyone see a mistake in my methods?  Would things speed up if, before I read it into an xml object, I ran a replace() or three to shorten up some of those names? -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com
<< Previous Thread Today's Threads Next Thread >>

Search cf-talk

May 24, 2012

<<   <   Today   >   >>
Su Mo Tu We Th Fr Sa
     1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31     

Designer, Developer and mobile workflow conference