This is a discussion on identify duplicate addresses? within the MySQL forums, part of the Database Server Software category; --> >> Euhm, in Holland a _postal code_ + house number is unique. It this not >> true for the ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| >> Euhm, in Holland a _postal code_ + house number is unique. It this not >> true for the US/UK/wherever you're from? > >I'm from the US. With the old postal code, that's not true. However, A 5-digit ZIP code could cover an entire small suburban city. (I think Benbrook, Texas = 76126 is an example). Then again, the Reader's Digest Sweepstakes "NO" response got its own 5-digit ZIP code. >the USPS introduced extra zip code digits that do make a unique >address, but not everyone is aware of them. I don't think so. 9-digit zip codes get down to a few blocks, but there is as far as I know no restriction that a single 9-digit zip code couldn't cover the 800-999 houses on Oak Street, the 800-999 houses on Elm Street and the 800-999 houses on Chestnut Street. 6 blocks, probable duplicate house numbers. My 9-digit zip code for a post office box was unique (5-digit zip for the boxes as a whole, followed by box number). I don't think 9 digits gives enough codes, given houses, post office boxes, and the business district all need them. >I think that the extended >zip code by itself identifies a unique address. But I can't rely on my >users knowing it. |
| |||
| On Tue, 20 Nov 2007 04:00:00 -0000, Gordon Burditt wrote: >>> Euhm, in Holland a _postal code_ + house number is unique. It this not >>> true for the US/UK/wherever you're from? >> >>I'm from the US. With the old postal code, that's not true. However, > > A 5-digit ZIP code could cover an entire small suburban city. (I > think Benbrook, Texas = 76126 is an example). Then again, the > Reader's Digest Sweepstakes "NO" response got its own 5-digit ZIP > code. > >>the USPS introduced extra zip code digits that do make a unique >>address, but not everyone is aware of them. > > I don't think so. 9-digit zip codes get down to a few blocks, but > there is as far as I know no restriction that a single 9-digit zip > code couldn't cover the 800-999 houses on Oak Street, the 800-999 houses > on Elm Street and the 800-999 houses on Chestnut Street. 6 blocks, > probable duplicate house numbers. In most residential cases, a 9-digit zipcode (ZIP+4 nominally) covers single large multi-residence building or a single side of a residential street block, such as 1801-1899 (odd) Oak St. Which does mean that ZIP+4 combined with a house number or apartment number generally gets a specific household. HOWEVER, for dealing with the situation the OP was describing, that would depend on users knowing their ZIP+4 and transcribing it correctly, OR doing address-cleansing well enough to match against a USPS-supplied ZIP+4 data file, and you're right back to very nearly CASS Certification again. > My 9-digit zip code for a post office box was unique (5-digit zip for > the boxes as a whole, followed by box number). > > I don't think 9 digits gives enough codes, given houses, post office > boxes, and the business district all need them. I dunno... You're dealing with a BILLION codes in ZIP+4. That's enough to give everyone in the country two of them and still have lots left over. -- For their next act, they'll no doubt be buying a firewall running under NT, which makes about as much sense as building a prison out of meringue. -- Tanuki |
| |||
| On Nov 19, 12:15 pm, Jerry Stuckle <jstuck...@attglobal.net> wrote: > > And what do you do with something like 123 E St. NW? Another thought: I have two fields for the address. One for the original user input, and one that is standardized for comparison. If a user enters "100 East E St", that's what goes in the first field. The second field gets "100 East East Street". The second field is used to find duplicates, while the first is used for mailings. The actual street name is "E", but for comparison purposes, if I convert all "E" street addresses to "East", it doesn't matter, because they all still match. As long as two different people don't live at "100 East East Street" and "100 East E Street" ( we would consider those two people to be a single cheater ) then the above should work. |
| |||
| On Tue, 20 Nov 2007 20:42:36 +0100, <lawpoop@gmail.com> wrote: > On Nov 19, 12:15 pm, Jerry Stuckle <jstuck...@attglobal.net> wrote: >> >> And what do you do with something like 123 E St. NW? > > Another thought: > > I have two fields for the address. One for the original user input, > and one that is standardized for comparison. > > If a user enters "100 East E St", that's what goes in the first field. > The second field gets "100 East East Street". > > The second field is used to find duplicates, while the first is used > for mailings. The actual street name is "E", but for comparison > purposes, if I convert all "E" street addresses to "East", it doesn't > matter, because they all still match. > > As long as two different people don't live at "100 East East Street" > and "100 East E Street" ( we would consider those two people to be a > single cheater ) then the above should work. The last is a dangerous assumption. Why wouldn't E Street be close to East Street? -- Rik Wasmus |
| |||
| On Nov 20, 2:02 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > > Another thought: > > > I have two fields for the address. One for the original user input, > > and one that is standardized for comparison. > > > If a user enters "100 East E St", that's what goes in the first field. > > The second field gets "100 East East Street". > > > The second field is used to find duplicates, while the first is used > > for mailings. The actual street name is "E", but for comparison > > purposes, if I convert all "E" street addresses to "East", it doesn't > > matter, because they all still match. > > > As long as two different people don't live at "100 East East Street" > > and "100 East E Street" ( we would consider those two people to be a > > single cheater ) then the above should work. > > The last is a dangerous assumption. Why wouldn't E Street be close to East > Street? What do you mean by 'close'? Do you mean literally near one another, geographically? As far as address matching is concerned, we are only concerned whether or not they exist in the same zip code. They could be close, but what are the chances that those two streets exist in the same zip code, and a person from each street living at the same number fills out a survey in the same time period? I don't necessarily need a bullet proof system; this one probably works 99.9% of the time. |
| |||
| >> > As long as two different people don't live at "100 East East Street" >> > and "100 East E Street" ( we would consider those two people to be a >> > single cheater ) then the above should work. >> >> The last is a dangerous assumption. Why wouldn't E Street be close to East >> Street? > >What do you mean by 'close'? Do you mean literally near one another, >geographically? As far as address matching is concerned, we are only >concerned whether or not they exist in the same zip code. Well, if you're talking about the same 9-digit zip code, they would normally be pretty close. >They could be close, but what are the chances that those two streets >exist in the same zip code, and a person from each street living at >the same number fills out a survey in the same time period? I don't >necessarily need a bullet proof system; this one probably works 99.9% >of the time. Remember that what you're doing invites malicious input. If someone discovers that they can get a different address just by changing the lower digit or two of the ZIP code, they'll do it. The same also may apply to 1201 Elm Street, 1201 Elm Avenue, 1201 Elm Boulevard, etc. Or these might really be different addresses. |
| |||
| Here is the solution I am choosing to go with: 1. Keep the user-entered mailing address for mailing. 2. Abbreviate the address completely in another column. For instance, " 123 East East Street Northwest" becomes "123 e e st nw" ( also set the street to lowercase and strip out periods so that they don't trip up comparisons) 3. Use the abbreviated address for duplicate checking. 4. Use the user-entered address for actual mailings. I think this should be good enough for cheaters trying to 'hack' the system and increase their odds when they have an ambiguous address. This doesn't prevent a single-occupant house from becoming an apartment building, or a suite of offices, however. |
| ||||
| On Nov 20, 6:50 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > On Tue, 20 Nov 2007 21:28:36 +0100, <lawp...@gmail.com> wrote: > > On Nov 20, 2:02 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > >> > Another thought: > > >> > I have two fields for the address. One for the original user input, > >> > and one that is standardized for comparison. > > >> > If a user enters "100 East E St", that's what goes in the first field. > >> > The second field gets "100 East East Street". > > >> > The second field is used to find duplicates, while the first is used > >> > for mailings. The actual street name is "E", but for comparison > >> > purposes, if I convert all "E" street addresses to "East", it doesn't > >> > matter, because they all still match. > > >> > As long as two different people don't live at "100 East East Street" > >> > and "100 East E Street" ( we would consider those two people to be a > >> > single cheater ) then the above should work. > > >> The last is a dangerous assumption. Why wouldn't E Street be close to > >> East > >> Street? > > > What do you mean by 'close'? Do you mean literally near one another, > > geographically? As far as address matching is concerned, we are only > > concerned whether or not they exist in the same zip code. > > And as far as I can gather, there's no reason an E street could not exist > in the smae postal code as East street. > > > They could be close, but what are the chances that those two streets > > exist in the same zip code, and a person from each street living at > > the same number fills out a survey in the same time period? I don't > > necessarily need a bullet proof system; this one probably works 99.9% > > of the time. > > Don't misinterpret the law of averages. I don't think street names are randomly distributed, nor is users filling out a survey on our website a random sampling. > Even with limited use, there's > still a possibility you encounter this problem. A lot of things have been > broken because programmers/designers just 'didn't think that would > happen'. If it can happen, be prepared. That's true, and in the case of a memory error, that would be a serious problem -- program crash, data loss. In this situation, it seems to be a simple lessing of risk. The more duplicates I can accurately identify, the less fraud we have. The more false duplicates I find, less honest entrants have a chance of winning. In other words, when my duplication checking routine fails, the impact isn't so great. If I'm not so good at detecting cheaters, a cheater slightly increases their odds. If my algorithm detects a cheater when they actually aren't cheating, that person is unfairly thrown from the contest. It's literally impossible to imagine every scenario and prepare for it. You would be developing and testing your system forever. Every engineering project is a trade-off of features, testing, and investment. I can't spend all my time trying to imagine yet another scenario where my code breaks down. This is a small project. It's been discussed pretty thoroughly on this list, some pretty smart people brought up some issues that I wasn't aware of, and I think that I have a system that works well enough. |