This is a discussion on Keyword Density/Count within the SQL Server forums, part of the Microsoft SQL Server category; --> Hi All, let me try to explain what I'm trying to accomplish - I really hope someone can help. ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi All, let me try to explain what I'm trying to accomplish - I really hope someone can help. I have a table (tblArticles) which has the following: vcrKeywords varchar(2000) txtBody text(8000) vcrType varchar(128) and this is a sample of the data vcrKeyWords || txtBody || vcrType key1,key2,key4,key7,key9 || <snipped body> || Site5 News key1,key3,key6,key8,key9 || <snipped body> || Site5 News key1,key3,key4,key5,key9 || <snipped body> || Site5 News key1,key2,key5,key7,key8 || <snipped body> || Site5 News What I'm trying to accomplish is to return a keyword count based on the content of vcrKeywords (i.e. each comma seperated entry as a count. My SQL statement originally was : SELECT vcrKeyWords, COUNT(vcrKeyWords) AS keycount FROM tblArticles where vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC However, this simply matches the entire vcrKeyword Column and not each comma seperated value. Is there a way in SQL that I can achieve this or do I need to use some kind of scripting language to accomplish it...? Regards, Carl. |
| |||
| Carl, It would take you more time than it's worth to develop a procedure to search for keywords on the fly in that mess. And once you do create it, it's going to be really slow because it's not going to be able to use any indexes. Let's look at a redesign. Looks to me like you have a many to many relationship between Keywords and Articles. tbl_keywords: keyword_id (unique-idx) keyword_name (pk) tbl_article: article_id (unique-idx) txtBody vcrType tbl_art_key: keyword_id (Clust-PK) article_id (Clust-PK) Now, for your query above with the new design: SELECT tk.keyword_name, count(tak.article_id) FROM tbl_keyword tk JOIN tbl_art_key tak ON tk.keyword_id = tak.keyword_id Cheers, Jason Lepack On Apr 16, 5:15 am, "news.demon.co.uk" <c...@ocvision.com> wrote: > Hi All, > > let me try to explain what I'm trying to accomplish - I really hope someone > can help. > > I have a table (tblArticles) which has the following: > > vcrKeywords varchar(2000) > txtBody text(8000) > vcrType varchar(128) > > and this is a sample of the data > > vcrKeyWords || txtBody || vcrType > key1,key2,key4,key7,key9 || <snipped body> || Site5 News > key1,key3,key6,key8,key9 || <snipped body> || Site5 News > key1,key3,key4,key5,key9 || <snipped body> || Site5 News > key1,key2,key5,key7,key8 || <snipped body> || Site5 News > > What I'm trying to accomplish is to return a keyword count based on the > content of vcrKeywords (i.e. each comma seperated entry as a count. > > My SQL statement originally was : > > SELECT vcrKeyWords, COUNT(vcrKeyWords) AS keycount FROM tblArticles > where vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC > > However, this simply matches the entire vcrKeyword Column and not each comma > seperated value. > > Is there a way in SQL that I can achieve this or do I need to use some kind > of scripting language to accomplish it...? > > Regards, > > Carl. |
| |||
| Ah, the joys of not proofreading... Now, for your query above with the new design: SELECT tk.keyword_name, count(tak.article_id) keycount FROM tbl_keyword tk JOIN tbl_art_key tak ON tk.keyword_id = tak.keyword_id JOIN tbl_article ta ON ta.article_id = tak.article_id WHERE ta.vcrType LIKE 'site 5%' GROUP BY tk.keyword_name ORDER BY keycount DESC On Apr 16, 8:41 am, "Jason Lepack" <jlep...@gmail.com> wrote: > Carl, > > It would take you more time than it's worth to develop a procedure to > search for keywords on the fly in that mess. And once you do create > it, it's going to be really slow because it's not going to be able to > use any indexes. Let's look at a redesign. > > Looks to me like you have a many to many relationship between Keywords > and Articles. > > tbl_keywords: > keyword_id (unique-idx) > keyword_name (pk) > > tbl_article: > article_id (unique-idx) > txtBody > vcrType > > tbl_art_key: > keyword_id (Clust-PK) > article_id (Clust-PK) > > Now, for your query above with the new design: > SELECT tk.keyword_name, count(tak.article_id) > FROM tbl_keyword tk JOIN tbl_art_key tak > ON tk.keyword_id = tak.keyword_id > > Cheers, > Jason Lepack > > On Apr 16, 5:15 am, "news.demon.co.uk" <c...@ocvision.com> wrote: > > > > > Hi All, > > > let me try to explain what I'm trying to accomplish - I really hope someone > > can help. > > > I have a table (tblArticles) which has the following: > > > vcrKeywords varchar(2000) > > txtBody text(8000) > > vcrType varchar(128) > > > and this is a sample of the data > > > vcrKeyWords || txtBody || vcrType > > key1,key2,key4,key7,key9 || <snipped body> || Site5 News > > key1,key3,key6,key8,key9 || <snipped body> || Site5 News > > key1,key3,key4,key5,key9 || <snipped body> || Site5 News > > key1,key2,key5,key7,key8 || <snipped body> || Site5 News > > > What I'm trying to accomplish is to return a keyword count based on the > > content of vcrKeywords (i.e. each comma seperated entry as a count. > > > My SQL statement originally was : > > > SELECT vcrKeyWords, COUNT(vcrKeyWords) AS keycount FROM tblArticles > > where vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC > > > However, this simply matches the entire vcrKeyword Column and not each comma > > seperated value. > > > Is there a way in SQL that I can achieve this or do I need to use some kind > > of scripting language to accomplish it...? > > > Regards, > > > Carl.- Hide quoted text - > > - Show quoted text - |
| |||
| I agree with Jason that normalizing the design of the table for keywords will be most beneficial. It adds so much power to what you can do with those keywords. A couple notes that may help too, especially if you are not in a position to change table design: - Scripting and client side languages have a very good support to deal with lists and arrays. Most of them implement some sort of "split" function where passing a list and delimiter as parameters will give you an array that will be much easier to deal with. Plus they do it fast. - Erland Sommarskog has excellent articles on Arrays and Lists in SQL Server that can help you to normalize the existing data and use it more efficiently: http://www.sommarskog.se/arrays-in-sql-2005.html http://www.sommarskog.se/arrays-in-sql-2000.html - If you just need the count of keywords and your keyword delimiter is always a comma, then you can write something like this to get the count: SELECT vcrKeywords, CASE WHEN LEN(vcrKeyWords) = 0 THEN 0 ELSE LEN(vcrKeyWords) - LEN(REPLACE(vcrKeyWords, ',', '')) + 1 END AS keycount FROM tblArticles WHERE vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC Not sure if your keywords column is a blank string or NULL when there are no keywords associated, so you can tune up the first CASE condition accordingly to handle the case when there are 0 keywords. HTH, Plamen Ratchev http://www.SQLStudio.com |
| |||
| Oops, I just did a copy/paste on the bottom part of your query... Not sure if you need the GROUP BY at all. If the intent is to summarize counts and there are multiple occurrences of the same set of keywords, then you can just add SUM for the CASE expression. Plamen Ratchev http://www.SQLStudio.com |
| |||
| > Is there a way in SQL that I can achieve this or do I need to use > some kind of scripting language to accomplish it...? > > Regards, > Hi All, Thanks for the pointers - I'd kind of figured it was too much hassle to be worth it and unfortunately I've inheritied this database which is 3+ years old and has *a lot* of content in it, so redesigning the schema would be a huge undertaking. I'm already using FTS Contains Clause to get the data out, so I'm going to have to make do with that for now... Thanks anyway, Carl. |
| ||||
| news.demon.co.uk wrote: > Thanks for the pointers - I'd kind of figured it was too much hassle to be > worth it and unfortunately I've inheritied this database which is 3+ years > old and has *a lot* of content in it, so redesigning the schema would be a > huge undertaking. The question is not how much content, but how many tables and how many things looking at those tables. (And if you can create views based on the new tables that look like the old ones, then you can avoid having to redesign the things-looking-at-them right away.) |