|
Mailing Lists
|
Home /
Groups /
ColdFusion Talk (CF-Talk)
Checking for duplicates in list being imported
I've got an application that a client uses to import email list fromLes Mizzell 05/22/08 04:12 P You you using a data loader native to the db, or are you inserting theSonny Savage 05/23/08 09:28 A I had to deal with this same issue some time back...Scott Weikert 05/23/08 09:51 A Depending on what database you are using, the database may be able to handle this quite easily.Jeff Price 05/23/08 11:02 A I've got an application that a client uses to import email list from Excel spreadsheets into a database for use to send e-newsletters to specific groups. The database has grown quite large, and when importing list to specific group at this point, they're forgetting if a particular email address had already been added, and needs something that prevents duplicates from being added to the list. I can think of a couple of clumsy (resource intensive) ways to check the addresses being imported against what's already in the database, but there must be an optimal way to do it. <cf_check_duplicates_before_insert> ?? Suggestions? You you using a data loader native to the db, or are you inserting the records one at a time? Native loaders generally have a way to prevent duplicates. If you're inserting one at a time, just do a query first to see if it exists. You'll definitely want an index on that column, possibly even a unique index. On Thu, May 22, 2008 at 4:10 PM, Les Mizzell <lesmizz@bellsouth.net> wrote: ----- Excess quoted text cut - see Original Post for more ----- I had to deal with this same issue some time back... Here's what I do. First, I create a temporary table for the import, and do a dump into that, straight up. I create an identity field so that every record has a unique integer value to use later. I run updates on the fields in that table to do a little data cleanup - trimming off bounding quotes, extra spaces or odd invisible characters on the ends, etc. Then, I'll run a query to pull out just the records with duplicate emails. <cfquery name="qryDupeList"> SELECT Email, ImportID FROM (import table) GROUP BY Email, ImportID HAVING COUNT(Email) > 1 ORDER BY Email </cfquery. I then loop over this query, skipping the first occurance of each email addy, and tucking the remaining matches' ImportIDs into a list. <cfset dupeImportIDList = ""> <cfset curEmail = ""> <cfloop query="qryDupeList"> <cfif Email IS NOT curEmail> <cfset curEmail = Email> <cfelse> <cfset dupeImportIDList = ListAppend(dupeImportIDList, ImportID)> </cfif> </cfloop> If there are any dupes, I just delete all records with those ImportIDs. <cfif ListLen(dupeImportIDList)> <cfquery> DELETE FROM (import table) WHERE ImportID IN (#dupeImportIDList#) </cfquery> </cfif> The only real trick with this is, the duplicate records deleted are kinda random. So if you have other data associated with those email addresses, and that data varies from record to record, you're basically flipping coins to see which one you end up with. You could choose to display the data on a page for the user to pick and choose which record out of the duped records they want to keep, naturally. Depending on what database you are using, the database may be able to handle this quite easily. MS SQL Server can handle this via a unique index. Take a look at INGORE_DUP_KEY. I don't know if other DB's have access to something like this, but I'd have to think most of them do. -jeff ----- Excess quoted text cut - see Original Post for more -----
|
May 24, 2012
|
Latest Fusion Authority Articles
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||