Re: Count unique values from overlapping values On Mar 12, 2:47 pm, Jerry Stuckle <jstuck...@attglobal.net> wrote:
> Kim wrote:
> > On Mar 12, 12:18 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote:
> >> On Wed, 12 Mar 2008 10:57:54 +0100, Kim <kims...@gmail.com> wrote:
> >>> On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote:
> >>>> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote:
> >>>>> On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote:
> >>>>>> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote:
> >>>>>>> On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com>
> >>>> wrote:
> >>>>>>>> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote:
> >>>>>>>>> Sample data:
> >>>>>>>>> colA,colB,sum
> >>>>>>>>> 33996344,33996351,8
> >>>>>>>>> 50331648,67276831,16945184
> >>>>>>>>> 50331648,68257567,17925920
> >>>>>>>>> 67276832,67276847,16
> >>>>>>>>> 67276848,67277023,176
> >>>>>>>>> 67277024,67277031,8
> >>>>>>>>> Wanted output:
> >>>>>>>>> 33996344,33996351,8
> >>>>>>>>> 50331648,68257567,17925920
> >>>>>>>>> Sum is done as follows (colB-colA+1).
> >>>>>>>>> Any idea how to get this outcome ?
> >>>>>>>>> This will give incorrect unique count, because it simply sums
> >>>> all
> >>>>>> up.
> >>>>>>>>> SELECT sum(colB-colA+1) FROM tbl
> >>>>>>>> If this is not what you want:
> >>>>>>>> SELECT sum(colB-colA+1) FROM tbl
> >>>>>>>> GROUP BY colA,colB
> >>>>>>>> ... you'll have to explain the logic further, as I can't really
> >>>> see
> >>>>>> it.
> >>>>>>>> What overlaps what, how, an how should that be handled?
> >>>>>>> To Rik Wasmus:
> >>>>>>> Its not what I want.
> >>>>>>> If you look at the sample when you should see 6 rows. Of these only
> >>>>>>> the 2 listed in "wanted output" is unique.
> >>>>>>> Here is why:
> >>>>>>> Row 1 is unique already, so no filtering needed.
> >>>>>>> Row 2 is not unique because row 3 contains the same + more and
> >>>>>>> therefor needs to be filtered out.
> >>>>>>> Row 3 is unique because it has the largest range with no
> >>>> overlapping.
> >>>>>>> Row 4-6 are all in the range which row 3 has and therefor needs to
> >>>> be
> >>>>>>> filtered out.
> >>>>>> So, if I understand correctly, you want the 'largest' range, and skip
> >>>>>> all
> >>>>>> records that fall in a bigger range?
> >>>>>> If that's the case, what should happen with:
> >>>>>> 1 4
> >>>>>> 2 6
> >>>>>> 1 3
> >>>>>> 8 12
> >>>>>> 9 10
> >>>>>> The problem here is row 2 falls partly in row 1, but I assume:
> >>>>>> 1 4
> >>>>>> 2 6
> >>>>>> 8 12
> >>>>>> If I assume these should count as different rows, I'd say this one:
> >>>>>> SELECT
> >>>>>> a.colA, a.colB, (a.colB-a.colA+1)
> >>>>>> FROM tablename a
> >>>>>> LEFT JOIN tablename b
> >>>>>> ON b.colA <= a.colA
> >>>>>> AND b.colA >= a.colA
> >>>>>> # having a pk would help
> >>>>>> # now we have to make sure it
> >>>>>> # doesn't match on it's own row:
> >>>>>> AND NOT (a.colA = b.colA AND a.colB = b.colB)
> >>>>>> WHERE b.colA IS NULL
> >>>>> Your understanding is correct, as is your assumption.
> >>>>> I had not yet considered the way you outline of a range overlapping
> >>>>> another range. I now see dark future of doing this without looping...
> >>>>> Regarding your query on sample data, this is what I get:
> >>>>> 33996344,33996351,8
> >>>>> 67276832,67276847,16
> >>>>> 67276848,67277023,176
> >>>>> 67277024,67277031,8
> >>>> Hmmz, just an error in typing it up, the second JOIN clause should
> >>>> offcourse be on colB:
> >>>> mysql> SELECT
> >>>> -> a.colA, a.colB, (a.colB-a.colA+1)
> >>>> -> FROM tester a
> >>>> -> LEFT JOIN tester b
> >>>> -> ON b.colA <= a.colA
> >>>> -> AND b.colB >= a.colB
> >>>> -> # having a pk would help
> >>>> -> # now we have to make sure it
> >>>> -> # doesn't match on it's own row:
> >>>> -> AND NOT (a.colA = b.colA AND a.colB = b.colB)
> >>>> -> WHERE b.colA IS NULL;
> >>>> +----------+----------+-------------------+
> >>>> | colA | colB | (a.colB-a.colA+1) |
> >>>> +----------+----------+-------------------+
> >>>> | 33996344 | 33996351 | 8 |
> >>>> | 50331648 | 68257567 | 17925920 |
> >>>> +----------+----------+-------------------+
> >>>> 2 rows in set (0.00 sec)
> >>> Yours actually work faster than Luuk's.
> >>> However the time it takes to sum up unique values on 10.000 rows takes
> >>> 94sec, which isnt fast in any way. I dare not think of how much time
> >>> it will take for millions of rows.
> >> Could you run it with an EXPLAIN before the SELECT and give us the output?
>
> >>> SELECT
> >>> sum(a.colB-a.colA+1)
> >>> FROM tbl a
> >>> LEFT JOIN tbl b
> >>> ON b.colA <= a.colA
> >>> AND b.colB >= a.colB
> >>> AND NOT (a.colA = b.colA AND a.colB = b.colB)
> >>> WHERE b.colA IS NULL
> >>> Why is the where clause there ? If its avoid NULL values, then its not
> >>> a problem since there are none in tbl.
> >> The WHERE clause is exactly there and works because colA is never NULL 
> >> Translated into human readable language this query is something like:
>
> >> Give the result of sum(a.colB-a.colA+1) from tbl, where there are no rows
> >> (colA IS NULL) in tbl where colA-colB falls into the range of this row
> >> (the join clause).
>
> >>> Therefor am I trying the SP way I described in my last post.
> >> One query with the right indexes (I sure hope you have some...) is often
> >> way faster then an SP.
> >> --
> >> Rik Wasmus
>
> >> Could you run it with an EXPLAIN before the SELECT and give us the output?
> > id select_type table type possible_keys key key_len ref rows
> > Extra
> > 1 SIMPLE a ALL NULL NULL NULL NULL 10000
> > 1 SIMPLE b ALL NULL NULL NULL NULL 10000 Using where; Not
> > exists
>
> >> One query with the right indexes (I sure hope you have some...) is often
> >> way faster then an SP.
> > There is only 1 normal index in the tbl, and its on colC (containing
> > an ID).
> > Upon creation of tbl it is sorted by colA, colB anyway. Adding any
> > sort of index on colA and/or colB will in fact make queries slower as
> > much as 500% ! So I have tried more indexes.
>
> Wrong. Tables are NEVER sorted. Never assume that rows will be in the
> order they were inserted.
>
> Currently you are doing a table scan. I would suggest an index,
> probably on colA, but you could try with colB and explain both ways.
>
> > Still working on the SP, but still with errors.
> > Thanks for the help
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstuck...@attglobal.net
> ==================
> Wrong. Tables are NEVER sorted. Never assume that rows will be in the
> order they were inserted.
You misunderstand. I dont assume it is sorted - I know it is. Because
when the table tbl is created the data is sorted on insertion. |