|
Mailing Lists
|
Home /
Groups /
ColdFusion Talk (CF-Talk)
Big XML files processing Really s-l-o-w Solution?
Author: Mark A. Kruger
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342453
Matt,
Very cool - an innovative approach I would not have thought of. Of course we
all know that Jochem's a certified genius :)
-Mark
Mark A. Kruger, MCSE, CFG
(402) 408-3733 ext 105
Skype: markakruger
www.cfwebtools.com
www.coldfusionmuse.com
www.necfug.com
Here's the update:
Jochem's StAX-based solution worked so well -- and it allowed me to
re-use existing code since I was already processing xml -- that I
didn't bother to consider testing the xml2csv utility.
Original code processing a 45mb file yielded an insert time of 90
seconds per record with total file processing time being a projected
66 hours.
Using code that incorporated xmlSplitter.cfc, the routine created 2682
discrete xml files in approximately 45 seconds. From there, the
insert loop did its work pulling out xml and pouring it into roughly
100 db table fields at ... 20 records per second.
Total time to read in the 45mb xml file, create the discrete files,
read them, insert them into the db and delete them (one at a time as I
went along in the loop) was 192250ms.
A bit more than three minutes.
Thats an acceptable level of improvement.
I owe you one, Jochem.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Rick Faircloth
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342452
Nice...
Here's the update:
Jochem's StAX-based solution worked so well -- and it allowed me to
re-use existing code since I was already processing xml -- that I
didn't bother to consider testing the xml2csv utility.
Original code processing a 45mb file yielded an insert time of 90
seconds per record with total file processing time being a projected
66 hours.
Using code that incorporated xmlSplitter.cfc, the routine created 2682
discrete xml files in approximately 45 seconds. From there, the
insert loop did its work pulling out xml and pouring it into roughly
100 db table fields at ... 20 records per second.
Total time to read in the 45mb xml file, create the discrete files,
read them, insert them into the db and delete them (one at a time as I
went along in the loop) was 192250ms.
A bit more than three minutes.
Thats an acceptable level of improvement.
I owe you one, Jochem.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Matt Quackenbush
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342451
Nice report! I'll try and remember that the next time I have to parse huge
XML files. :-)
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342450
Here's the update:
Jochem's StAX-based solution worked so well -- and it allowed me to
re-use existing code since I was already processing xml -- that I
didn't bother to consider testing the xml2csv utility.
Original code processing a 45mb file yielded an insert time of 90
seconds per record with total file processing time being a projected
66 hours.
Using code that incorporated xmlSplitter.cfc, the routine created 2682
discrete xml files in approximately 45 seconds. From there, the
insert loop did its work pulling out xml and pouring it into roughly
100 db table fields at ... 20 records per second.
Total time to read in the 45mb xml file, create the discrete files,
read them, insert them into the db and delete them (one at a time as I
went along in the loop) was 192250ms.
A bit more than three minutes.
Thats an acceptable level of improvement.
I owe you one, Jochem.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342417
On Thu, Feb 17, 2011 at 8:56 AM, Rick Faircloth wrote:
>
> Good to know!
Indeed. Thanks to Marco and Mark. I plan on trying out both this and
Jochem's solution starting maybe as soon as tomorrow and certainly
through the long U.S. holiday weekend. Maybe try both and see which
is faster. I'll post back the results.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Rick Faircloth
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342388
Good to know!
In our company I found a solution to parse big XML files in ColdFusion
applications:
1) using xml2csv(http://www.a7soft.com/xml2csv.html), a very fast
application to convert XML to csv file;
2) with SQL Server BULK INSERT to insert in our database
Using this approach reduced time in 100 times. Very fast.
Cheers
Marco Antonio
On Thu, Feb 17, 2011 at 1:30 PM, Rick Faircloth
<Rick@whitestonemedia.com>wrote:
----- Excess quoted text cut - see Original Post for more -----
Author: Mark A. Kruger
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342383
WE've used this approach as well... seems like a kludge but it actually
works really really well. MS Text driver and DTS (or SSIS) are exceptionally
fast for importing textual data. We have a process that consumes 350,000
rows of data in about 5-8 seconds... and does it every 2 minutes.
-Mark
In our company I found a solution to parse big XML files in ColdFusion
applications:
1) using xml2csv(http://www.a7soft.com/xml2csv.html), a very fast
application to convert XML to csv file;
2) with SQL Server BULK INSERT to insert in our database
Using this approach reduced time in 100 times. Very fast.
Cheers
Marco Antonio
On Thu, Feb 17, 2011 at 1:30 PM, Rick Faircloth
<Rick@whitestonemedia.com>wrote:
----- Excess quoted text cut - see Original Post for more -----
Author: Marco Antonio C. Santos
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342382
In our company I found a solution to parse big XML files in ColdFusion
applications:
1) using xml2csv(http://www.a7soft.com/xml2csv.html), a very fast
application to convert XML to csv file;
2) with SQL Server BULK INSERT to insert in our database
Using this approach reduced time in 100 times. Very fast.
Cheers
Marco Antonio
On Thu, Feb 17, 2011 at 1:30 PM, Rick Faircloth
<Rick@whitestonemedia.com>wrote:
----- Excess quoted text cut - see Original Post for more -----
Author: Rick Faircloth
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342381
No problem... I haven't done much with XML, myself, so I thought
this presented a good opportunity to learn something new, too.
Let us know how the solution you're working on pans out.
> Perhaps this will be of some help.
Thanks Rick, Interesting stuff. I don't work with xml so much so a
lot of this is new, and I hate new :-). I haven't been a full-time
programmer for a few years now so I haven't kept up with everything
like I would have otherwise.
I have a business trip today so I have to set this aside but I'm going
to run Jochem's solution - which is presented darn near turnkey - as
soon as I can tomorrow and see what happens.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342379
> Perhaps this will be of some help.
Thanks Rick, Interesting stuff. I don't work with xml so much so a
lot of this is new, and I hate new :-). I haven't been a full-time
programmer for a few years now so I haven't kept up with everything
like I would have otherwise.
I have a business trip today so I have to set this aside but I'm going
to run Jochem's solution - which is presented darn near turnkey - as
soon as I can tomorrow and see what happens.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Rick Faircloth
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342361
Hi, Matt...
I've been tinkering with parsing the xml file and, though
I haven't developed a full solution, I've made some progress
that I thought I'd share.
I created a MySQL 5 database called, "xmlTest".
I created a MySQL 5 database table called, "xmlBlob".
Fields:
- rowID (int, not null, auto increment)
- xmlContent (blob)
- comment (varchar, 200)
I then ran this code:
<cfset filePath = #expandPath('data.xml')#>
<cfoutput>filePath = #filePath#</cfoutput> (checking accuracy)
<cfquery name="processXML" datasource="xmlTest">
select xmlContent
from xmlBlob
limit 1
into @xml
</cfquery>
Data extraction method 1:
<cfquery name="extractData01" datasource="xmlTest">
select extractValue(@xml, '//dataRecord[1]/recordID[1]')
recordID
</cfquery>
<p>cfdump toString(extractData01.recordID) = <cfdump
var="#toString(extractData01.recordID)#"></p>
Data extraction method 2:
<cfquery name="extractData02" datasource="xmlTest">
select extractValue(b.xmlContent, '//dataRecord/recordID')
recordID,
extractValue(b.xmlContent,
'//dataRecord/year') year,
extractValue(b.xmlContent,
'//dataRecord/make') make,
extractValue(b.xmlContent,
'//dataRecord/model') model
from xmlBlob b
</cfquery>
<cfoutput>extractData02.recordCount =
#extractData02.recordCount#</cfoutput><br>
<br>
<cfoutput query="extractData02">
#toString(extractData02.recordID)#<br>
#toString(extractData02.year)#<br>
#toString(extractData02.make)#<br>
#toString(extractData02.model)#<br>
<br>
</cfoutput>
-------------------------------------------------------
When the code above is run in a browser, it outputs:
cfdump toString(extractData01.recordID) = 123
extractData02.recordCount = 1
123 124
1980 1981
Trabant Ford
TurboXL TurboDL
-------------------------------------------------------
I was just to the point of figuring out how to get
the two records (I modified your data.xml file example so there
would be two records) to show separately, instead of
the data being in a list for the fields, when I saw your
post and thought I'd go ahead and respond so you could see
if this might be an approach that would work for you.
In the data extraction method 1 above, the idea is to insert
the xml file (which might be too large for this?)
into a blob field in a MySQL table, then read
the blob field and put the xml data file into a variable, '@xml',
then run a cfquery against the variable.
Data extraction method 2 skips reading the xml into a variable
and queries the blob field directly. This method uses the
MySQL 'xpath_expression' syntax to access data in multi-level
xml data files. With xpath, you can go as deeply into the levels
as needed.
Like I said, this is just the result of tinkering around with
what I've been finding and attempting to get some usable output.
Perhaps this will be of some help.
Rick
> Would it help if you split the file into individual records before
> processing them?
I think it would help immensely I think. I'd prefer to fight my way
thru this in CF if at all possible and things are just complex enough
that I may need to go the route of stax. Outside my comfort zone in
terms of installing it on the server but I can likely figure it out.
This morning I investigated the possibility of dropping into java and
reading one line at a time, parsing out the text in that line... Its
what I do in this file's CSV-format predecessor. But there are no
CR's or LF's in the file so that is out.
Thanks very much for that post and your thoughts.
Incidentally if anyone is looking for an xml reader/editor that can
almost instantly read these kinds of moster files, check out the free
firstobject xml editor.
http://www.firstobject.com/
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342360
> Would it help if you split the file into individual records before
> processing them?
I think it would help immensely I think. I'd prefer to fight my way
thru this in CF if at all possible and things are just complex enough
that I may need to go the route of stax. Outside my comfort zone in
terms of installing it on the server but I can likely figure it out.
This morning I investigated the possibility of dropping into java and
reading one line at a time, parsing out the text in that line... Its
what I do in this file's CSV-format predecessor. But there are no
CR's or LF's in the file so that is out.
Thanks very much for that post and your thoughts.
Incidentally if anyone is looking for an xml reader/editor that can
almost instantly read these kinds of moster files, check out the free
firstobject xml editor.
http://www.firstobject.com/
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Jochem van Dieten
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342353
> I am tasked with doing an import on a file whose size can essentially
> be unlimited. We've been able to handle in the 10-15mb range but it
> recently ballooned to 100 mb, and its going to get larger. Processing
> time seems to be about 66 hours for a 45 mb file and thats a disaster.
Would it help if you split the file into individual records before
processing them?
http://jochem.vandieten.net/2011/02/17/splitting-large-xml-files-with-coldfusion-and-stax/
Jochem
--
Jochem van Dieten
http://jochem.vandieten.net/
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342342
Yeah I have seen that, but I'm up against a time constraint and there
are a number of issues I don't have my head wrapped around, and the
client needs a solution asap given that he has a daily feed
encompassing thousands of records and dozens of clients ... that has
now been dead about 10 days.
I'm sure I could figure it out with enough time, but the client would
be better served if someone who actually has done this can hit the
ground running on it.
BUT with this whole thing in mind, there are some things I don't get
about this whole load_file() thing that maybe you or someone else can
answer. I get how the file is pulled into memory, but how can a
multi-level file get pulled in?
Hopefully this excerpt (and its only a fragment) will explain my problem.
- I don't need the header sectioon and want to skip over that.
- The dataArea contains the line item records I want to pull in.
- I can write CF code to pull the price and currency out of the
pricing section, but how would mySQL do something like that
where the xml design was made as complicated as humanly
possible? (this format is StarStandard.org-based, btw)
- Note the ImageAttachmentExtended records. There can
be any number of those, or none. A CFloop works great but ...
- the VehicleRemarketingBoat section can have any number
of line items *or groups* and they vary from record to record.
The groups can have more than one entry in them, or not.
Here again CF gives me tools to deal with this.
I just don't see how something like this can be dumped into a db at
all. I'm assuming its inexperience with the task. Or is this xml a
bit beyond that type of approach?
<headerSection>
<headerField1>blah</headerField1>
<headerField2>blah</headerField2>
<headerField3>yoo</headerField3>
<headerField4>hoo</headerField4>
<headerField5>woof</headerField5>
</headerSection>
<dataArea>
<dataRecord>
<recordID>123</recordID>
<year>1980</year>
<make>Trabant</make>
<model>TurboXL</model>
<pricing>
<hidePrice>false</hidePrice>
<hidePrice>8</hidePrice>
<Price>
<ChargeAmount currencyID='USD'>99500</ChargeAmount>
</Price>
</pricing>
<Location>
<LocationAddress>
<CityName>Annapolis</CityName>
<CountryID>US</CountryID>
<StateOrProvinceCountrySubDivisionID>MD</StateOrProvinceCountrySubDivisionID>
</LocationAddress>
</Location>
<ImageAttachmentExtended>
<ID>46795014</ID>
<URI>http://img.arf.com/images/1/70/33/3197033_0_080220111124_1.jpg</URI>
<ImageWidthMeasure
unitCode='pixel'>412</ImageWidthMeasure>
<ImageHeightMeasure
unitCode='pixel'>471</ImageHeightMeasure>
<ImageLastModifiedDateTime>2011-02-08T11:24:00-08:00</ImageLastModifiedDateTime>
<UsagePreference>
<PriorityRankingNumeric>1</PriorityRankingNumeric>
</UsagePreference>
</ImageAttachmentExtended>
<ImageAttachmentExtended>
<ID>123456</ID>
<URI>http://img.arf.com/images/1/70/33/3197033_0_080220111124_2.jpg</URI>
<ImageWidthMeasure
unitCode='pixel'>412</ImageWidthMeasure>
<ImageHeightMeasure
unitCode='pixel'>471</ImageHeightMeasure>
<ImageLastModifiedDateTime>2011-02-08T11:24:00-08:00</ImageLastModifiedDateTime>
<UsagePreference>
<PriorityRankingNumeric>2</PriorityRankingNumeric>
</UsagePreference>
</ImageAttachmentExtended>
<ImageAttachmentExtended>
<ID>987654</ID>
<URI>http://img.arf.com/images/1/70/33/3197033_0_080220111124_3.jpg</URI>
<ImageWidthMeasure
unitCode='pixel'>412</ImageWidthMeasure>
<ImageHeightMeasure
unitCode='pixel'>471</ImageHeightMeasure>
<ImageLastModifiedDateTime>2011-02-08T11:24:00-08:00</ImageLastModifiedDateTime>
<UsagePreference>
<PriorityRankingNumeric>1</PriorityRankingNumeric>
</UsagePreference>
</ImageAttachmentExtended>
<VehicleRemarketingBoat>
<MakeString>Starratt and Jenks</MakeString>
<ModelYear>1979</ModelYear>
<SaleClassCode>Used</SaleClassCode>
<Model>Morgan 45</Model>
<BoatLengthGroup>
<BoatLengthCode>Nominal Length</BoatLengthCode>
<BoatLengthMeasure
unitCode='feet'>45</BoatLengthMeasure>
</BoatLengthGroup>
<BoatLengthGroup>
<BoatLengthCode>Length At Water Line</BoatLengthCode>
<BoatLengthMeasure
unitCode='feet'>31.42</BoatLengthMeasure>
</BoatLengthGroup>
<BoatLengthGroup>
<BoatLengthCode>Length Overall</BoatLengthCode>
<BoatLengthMeasure
unitCode='feet'>45</BoatLengthMeasure>
</BoatLengthGroup>
<BeamMeasure unitCode='feet'>11</BeamMeasure>
<DraftMeasureGroup>
<DraftMeasure unitCode='feet'>6.08</DraftMeasure>
<BoatDraftCode>Max Draft</BoatDraftCode>
</DraftMeasureGroup>
<DisplacementMeasure
unitCode='pound'>25000</DisplacementMeasure>
<BoatCategoryCode>Sail</BoatCategoryCode>
<BoatClassGroup>
<BoatClassCode>Cruisers</BoatClassCode>
<PrimaryBoatClassIndicator>true</PrimaryBoatClassIndicator>
</BoatClassGroup>
<GeneralBoatDescription>One Big
Boat</GeneralBoatDescription>
<BuilderName>Starratt and Jenks</BuilderName>
<DesignerName>Charles Morgan</DesignerName>
<BoatName>Inventions</BoatName>
<Hull>
<BoatHullMaterialCode>Fiberglass</BoatHullMaterialCode>
</Hull>
<BallastWeightMeasure
unitCode='pound'>12500</BallastWeightMeasure>
</VehicleRemarketingBoat>
</dataRecord>
</dataArea>
----- Excess quoted text cut - see Original Post for more -----
Author: Russ Michaels
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342343
Matt, as you want a non CF solution, you might be best posting elsewhere to
find a database guru who could do it all directly on the DB.
If no-one here pipes up then try www.odesk.com
On Wed, Feb 16, 2011 at 6:09 PM, Rick Faircloth
<Rick@whitestonemedia.com>wrote:
----- Excess quoted text cut - see Original Post for more -----
Author: Rick Faircloth
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342338
Hi, Matt...
I know you might have looked at this link at dev.mysql.com
for a solution to processing your data feed, but I thought I'd
bring it up, just in case you haven't seen it. Looks like it
might be right up your alley.
http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html#xml-5.
1-in-and-out
While I didn't have to process xml files, I do have to process
a daily data feed of real estate data in comma-delimited files
and insert that info into a MySQL 5 database.
At first, I tried processing the files with CF, but it was a
*really* slow process using CF, taking almost a minute to process
some files. Finally, someone on this list
suggested I take a look at the MySQL commands for loading data
from files, such as, in your case, "Load_File()", which opens
an entire XML document, makes it available as a string,
and inserts this string into a table column. In my case, using
MySQL's load data infile, the time was reduced to less than a second.
Check out the page above for details on usage and syntax.
hth,
Rick
Hey everybody, we're in a bind here and I pitched the idea to the
client of bringing in a ringer. Is there someone out there who is
interested in building a direct-to-SQL Server or direct-to-mySQL-based
solution?
Solution would have to, on a scheduled basis, grab a file in a
location on the server on its own without CF and process it (we have
stuff that retrieves/places the file). Current file size is about 50
mb but must be expected to go to around 100. At 42mb using just CF I
am at about 70 seconds per record loop. We need a *significant*
improvement in performance. Work would be on a dedicated Windows
server. Present box is a fairly powerful Crystaltech Win2003 Server
w/4GB RAM and a Xeon 2.8 processor w/4 cores. We'd consider a
Win2008, 64-bit installation to improve speed. Looking at a 24-core
server at Viviotech w/16gb as a hi-horsepower alternative.
Our feed partner is probably going to be able to use a different
method of delivery that will reduce file size, but for now we need to
plan for the worst and move on it.
Anyone interested? You can email me at my for-reals email at matt AT
mysecretbase DOT com.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342336
Hey everybody, we're in a bind here and I pitched the idea to the
client of bringing in a ringer. Is there someone out there who is
interested in building a direct-to-SQL Server or direct-to-mySQL-based
solution?
Solution would have to, on a scheduled basis, grab a file in a
location on the server on its own without CF and process it (we have
stuff that retrieves/places the file). Current file size is about 50
mb but must be expected to go to around 100. At 42mb using just CF I
am at about 70 seconds per record loop. We need a *significant*
improvement in performance. Work would be on a dedicated Windows
server. Present box is a fairly powerful Crystaltech Win2003 Server
w/4GB RAM and a Xeon 2.8 processor w/4 cores. We'd consider a
Win2008, 64-bit installation to improve speed. Looking at a 24-core
server at Viviotech w/16gb as a hi-horsepower alternative.
Our feed partner is probably going to be able to use a different
method of delivery that will reduce file size, but for now we need to
plan for the worst and move on it.
Anyone interested? You can email me at my for-reals email at matt AT
mysecretbase DOT com.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342231
Very interesting stuff, guys. Very much appreciated. Should have
come here before I did the project as opposed to afterward. You'd
think I'd know better by now.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Russ Michaels
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342229
That is probably the solution I was thinking of too, I expect Ben or someone
just wrapped it up into a UDF or CFC or posted it on one of the
aforementioned sites.
Russ
----- Excess quoted text cut - see Original Post for more -----
Author: Michael Grant
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342227
That actually looks like a pretty nice solution if bulkinsert isn't an
option.
----- Excess quoted text cut - see Original Post for more -----
Author: Russ Michaels
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342226
Processing the XML with Java rather than CFML will speed things up.
check riaforge.org or cflib.org and the adobe exchange, somewhere there is a
XML parser that does just this and may give you the boost you want.
Russ
On Tue, Feb 15, 2011 at 12:15 AM, Casey Dougall <
casey@uberwebsitesolutions.com> wrote:
----- Excess quoted text cut - see Original Post for more -----
Author: Alan Rother
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342225
When in doubt, ask Ben Nadel(or check his blog)... Words to live by
I had the same issue, huge variable XML file, lots of variations, had a hell
of a time making it work in SQL Server due to the complexity
Used this, works amazingly well, all things considered, executes
very quickly
http://www.bennadel.com/blog/1345-Ask-Ben-Parsing-Very-Large-XML-Documents-In-ColdFusion.htm
=]
On Mon, Feb 14, 2011 at 5:15 PM, Casey Dougall <
casey@uberwebsitesolutions.com> wrote:
----- Excess quoted text cut - see Original Post for more -----
Author: Casey Dougall
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342223
On Mon, Feb 14, 2011 at 7:07 PM, Matt Robertson
<websitemaker@gmail.com>wrote:
>
> Client is running mySQL, But a SQL Server could probably be arranged
> if we have to. Kind of a painful investment for them, though.
>
>
Well seems there is support for this in mysql as well but examples use
something simular to BulkInsert where the xml file is on the MySql Server.
http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342221
> Either that, or start looking at something like StaX -
> http://stax.codehaus.org/Home
thx for the tip on Stax but it frankly looks a bit out of my league
for this project.
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Michael Grant
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342222
> > you can make use of BulkInsert
+1. I love love love bulkinsert.
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342220
Client is running mySQL, But a SQL Server could probably be arranged
if we have to. Kind of a painful investment for them, though.
I was kind of hoping you guys would find a bonehead mistake that would
magically clear this up. I'm out of CF-based ideas.
Speaking of which, I tried shortening those var names with some
replace() statements and it had zero effect. Didn't expect it would
but wanted to throw it up against the wall.
Damn shame I can't drop to java and do a line-by-line read like I do
with CSV files :-(
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
Author: Casey Dougall
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342218
----- Excess quoted text cut - see Original Post for more -----
yeah, I don't know, still seems better equipped for SQL server if you can do
it there.
<cfstoredproc procedure="spBIG_XML_Insert" datasource="#request.DataSource#"
returncode="Yes">
<cfprocparam type="In" cfsqltype="CF_SQL_VARCHAR" variable="mydoc"
value="#r.SXMLRESPONSE#">
</cfstoredproc>
And then the trimmed down Stored Procedure....
ALTER PROCEDURE [dbo].[spBIG_XML_Insert]
@mydoc xml,
AS
declare @hdoc int
-- Create an internal representation of the XML document.
EXEC sp_xml_preparedocument @hdoc OUTPUT, @mydoc, '<soap:Envelope
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:c="urn:Scores/Services/v2/ScoresUpdate" />'
-- Now do the insert
INSERT INTO tblScoreUpdates
SELECT
MemberID = x.MemberID,
ReportDate = x.ReportDate,
Score = x.Score,
DateAdded = GetDate()
FROM OPENXML (@hdoc,
'soap:Envelope/soap:Body/c:LatestScoresResponse/c:parameters/c:Scores',2)
WITH
( MemberID varchar(10) '@MemberID',
ReportDate varchar(25) '@ReportDate',
Score int '@Score') as x
WHERE
x.MemberID NOT IN (SELECT t.MemberID
FROM tblScoreUpdates t With
(NoLock)
WHERE t.MemberID =
x.MemberID)
;
--Do another query here if you want,
--remove the document from memory
EXEC sp_xml_removedocument @hDoc
And if you needed something from a level up... just dot notation back to it.
LogID = int '../@LogID which would grab the logID from Parameters...
Author: Mark Mandel
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342216
Either that, or start looking at something like StaX -
http://stax.codehaus.org/Home
Pulling that large a file in memory is going to suck a lot, no matter
what you do.
Mark
----- Excess quoted text cut - see Original Post for more -----
Author: Casey Dougall
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342215
On Mon, Feb 14, 2011 at 5:44 PM, Matt Robertson
<websitemaker@gmail.com>wrote:
----- Excess quoted text cut - see Original Post for more -----
Are you using MS SQL 2005? Because throwing the XML file directly at SQL
server works a 1000% better. I've basically given up on processing large XML
files directly with coldfusion. IF you can save these xml files on the SQL
server you can make use of BulkInsert but that needs to read the file from
that server. OR just setup a stored procedure, send your xml over and let it
do it's thing.
http://msdn.microsoft.com/en-us/library/ms345117%28v=SQL.90%29.aspx
Author: Matt Robertson
Short Link: http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:62635#342213
My question is how do I reduce processing time... is there anything I
could be doing better?
I am tasked with doing an import on a file whose size can essentially
be unlimited. We've been able to handle in the 10-15mb range but it
recently ballooned to 100 mb, and its going to get larger. Processing
time seems to be about 66 hours for a 45 mb file and thats a disaster.
For a 14mb file its about 90 minutes.
Whats happening is this: CF is looping over a very large number of
records that are in themselves fairly complex. The more records there
are, the longer the loop over them takes. More records + more time
per record = a seemingly geometric increase in processing time.
The file is read in like so.
<!---
read the imported file
--->
<cflock
name="mates_import"
type="EXCLUSIVE"
timeout="10">
<cffile
action="READ"
file="#variables.mates.srcFile#"
variable="x">
</cflock>
<cfset x =
replaceNoCase(x,"StateOrProvinceCountrySub-DivisionID","StateProvince","ALL")>
<!---
turn the file into a coldfusion xml object
--->
<cfset x=ltrim(trim(x))>
<cfset x=XMLParse(x)>
The above takes only a few seconds. No problem there.
Next I have to read in some header info
<cfscript>:
header.logicalID=x.processvehicleremarketing.applicationarea.sender.logicalID.xmltext;
header.taskID=x.processvehicleremarketing.applicationarea.sender.taskID.xmltext;
header.BODID=x.processvehicleremarketing.applicationarea.BODID.xmltext;
header.created=x.processvehicleremarketing.applicationarea.CreationDateTime.xmltext;
// ...
// and here comes the node with all of the line items in it I'll have
to loop over. This is where all of the speed issues have been traced
to:
variables.mates.boatArrayLen=arrayLen(x.processvehicleremarketing.ProcessVehicleRemarketingDataArea.VehicleRemarketing);
</cfscript>
knowing the array length I can use CFLOOP to look over it and pull
data in where it is then stored in a db.
<cfloop
from="1"
to="#variables.mates.boatArrayLen#"
index="i">
<cfscript>
listings_mates.inHouseListingNumber="M-" &
x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingHeader.documentIdentificationGroup.documentIdentification.documentID.xmltext;
listings_mates.price=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.pricingABIE.price.chargeAmount.xmltext;
listings_mates.currency=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.pricingABIE.price.chargeAmount.xmlAttributes.currencyID;
// there can be more than one of these items so run a loop inside
of the loop
variables.mates.engineArrayLen=arrayLen(x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.VehicleRemarketingEngineLineItem);
ii=0;
do {
ii=ii+1;
listings_mates.engineDesc=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.VehicleRemarketingEngineLineItem[ii].VehicleRemarketingEngine.modelDescription.xmltext;
listings_mates.engHrs=x.processVehicleRemarketing.processVehicleRemarketingDataArea.vehicleRemarketing[i].vehicleRemarketingBoatLineItem.VehicleRemarketingEngineLineItem[ii].VehicleRemarketingEngine.totalEngineHoursNumeric.xmltext;
} while (ii LTE variables.mates.engineArrayLen);
...
</cfscript>
...
</cfloop>
And so on. A hundred or so fields and a dozen or so loops inside the
main loop, along with a loop or two inside of those. So the very long
variable names get even longer.
As you can see I am pouring the data into a struct, and when done, I
insert it as a db record.
Anyone see a mistake in my methods? Would things speed up if, before
I read it into an xml object, I ran a replace() or three to shorten up
some of those names?
--
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com
|
May 24, 2012
|
Latest Fusion Authority Articles
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||