Presentation Transcript: Storing and Querying Documents with SQL Server 2012

Presentation Transcript: Storing and Querying Documents with SQL Server 2012

SQL11UPD03-TSCRIPT-05

This wiki is a transcript of a previously recorded video.

Related content assets:


Storing and Querying Documents with SQL Server 2012

Hi I’m Greg Low. I’m a SQL Server MVP and part of the Microsoft Regional Director Program. In this session I just want to spend a few minutes looking at storing and querying documents using SQL Server 2012.

 

First up, we’ll take a quick wander through I think general directions this is heading. I want to then specifically look at FileTable, a number of changes around Full Text indexing, and an additional option on the near operator where we can have customized proximity. Finally, a really interesting additional option in the product that allows us to do statistical semantic search.

 

First up, if you look at where we’re at in terms of document storage, the SQL Server team has a goal of being the preferred choice for managing the storage of data. It doesn’t matter whether it’s structured data or unstructured data. In fact, it’s often described as somewhere around 80 percent – although they say around 80 percent of statistics are made up – but if we say 80 percent of all the data in an organization tends to be data that isn’t stored in a database today, now, most of that data isn’t structured data, so we need to do a better job of storing things that aren’t just numbers and strings and common sorts of data types. In SQL Server 2008 there were reasonably good down payments made in this direction. We had better ways of storing blob data. FILESTREAM was introduced, which gave us another alternative to how we might store big lumps of data.

 

There has always been a discussion in the industry; if I have something like photos or any sort of image data and any sort of large objects, there’s been a discussion that says do I store it in the database, or do I store it in the file system and stick pointers in the database to the file system? There are problems in both cases. If I store the data out in the file system and I have pointers to it the problem with that is it’s really hard to maintain the transactional consistency. It’s easy to end up with an object and no pointer, or to end up with a pointer and no object. Also the way you get to that data has a completely different security model to the rest of the data that’s stored in the database. What FILESTREAM provided was the ability to have a sort of middle ground where you could access things via the database. SQL Server maintains the security but it also maintains the transactional consistency on the data. If we do need to do something like streaming the data, we don’t want to do that through T-SQL and TDS because that was never designed for things like streaming data. We want to do that directly from Win32 API calls. FILESTREAM allowed us to do that and it was a very good down payment in this direction.

 

In SQL Server 2012 the aim is to go to much, much higher scale so to be able to deal with very large numbers of documents. They’ve also tried to make it easier to work with documents in a folder. For example, if I just read and write documents and want to drop them in a folder having them automatically become contents of a file or a table inside SQL Server, then we want to be able to index those for rapid retrieval, and finally we want to be able to derive some sort of meaning out of them and not just an index on the data.

 

FileTable is the new object that allows us to do this. If we look in Object Explorer, you’ll find under tables that there is now FileTables. This is a special type of table created by the user but you don’t build the columns. It has a fixed schema of columns. Notice down at the bottom here I’ve got a StreamID, and a file stream and a name and a path and there’s a file type and so on. All of the things that you’d typically expect to see as attributes of a file sitting inside of a folder. The idea is that if I say create table, in this case external files as a FileTable, but then instead of listing the columns I say here is the folder that this is going to be involved with and this is the collation that I want to use. In this case I’ve specified the database default. This is taking advantage of the down payment that was made in FILESTREAM and expanding it out further.  What you end up with is a folder sitting inside Windows that looks to Windows applications just like it’s a normal folder, but you can also get to that same folder and do things like indexing and so on directly from inside SQL Server.

 

Full Text I thought was again one of my favorite things in earlier versions of the product. SQL Server 2008 was a significant turning point for Full Text indexing in SQL Server. I find that most rooms I talk to there are probably 85 percent of people have never touched Full Text, 10 percent have some old pain associated with really early versions when they first tried it and then there are around five percent of people going, yes, it’s just the most amazing thing, I love it, tell me what’s happening. I really would love to see those sort of numbers turn around because Full Text allows you to build the sort of applications that users want, rather than what IT people like.

 

I find that users love everything soft and fuzzy and they want to be able to use their own words and so on. IT people love everything nice and neat and precise, but that’s just not what users want. If you look at the evolution of common programs, I look at something like Bing Maps. If you remember to what those sort of mapping programs looked like when they were first introduced, most people can’t remember back to what they looked like. What they had on the screens was they had all sorts of separate little boxes, like what street number, what street, what suburb, what city and so on. They have all evolved now to a point where they have a single text box that says what are you looking for?

 

It’s interesting, for most business applications today, we still build them like what those things used to look like. Lots of separate little boxes that make the user conform with the application. But the users don’t want to do that. Just ask yourself, how many of your business applications can users just go in and type in a bit of text and you can give them a list of what it might be they’re after? Then even if you do provide that, it’s also, how clever is that? If I’m a user and I say the word “drive” and the data has “driving” or “drove”, would that actually return the data? To a user they’re the same concept, even though normally in SQL these are completely separate concepts. That’s an understanding that Full Text indexing has.

 

In addition, things like a thesaurus. Maybe we have a merger of two parts of a company. All of the documents from this part of the company, or the old part of the company call them clients, the other company we’ve joined together with all, say, customers. But they all mean the same thing. So if someone searches for client, does it automatically search for customers? But even more, if they just want to type in a piece of text, like find me something about, here’s a sentence. Do you have systems to have something be returned that’s sensible when people make that sort of request? If you haven’t then you need to look at Full Text indexing. It’s a significant part of the product. There was support for a wide variety of languages in 2008. One of the biggest ones that I heard people say where is it, was Greek. So in this version both Greek and Czech have been added and they also have updated word breakers and stemmers for many languages.

 

Another big one is that while we could index the contents of a document using something like FILESTREAM before, there were aspects of the document such as the properties or the metadata – who was the author, when was the thing last modified and so on – these were properties we had no way of searching. We could search the contents of the documents, but not the properties. In this version we have the ability to do property searching as well.

 

Another one is the concept of Near. Now if you haven’t worked with Full Text this might not be immediately obvious what this is about, but imagine I’m searching for two words. Most people would have the experience that if I go into a search engine and I put two words in, the closer the words are together in the results that come back, usually the more meaningful or likely to be of interest the results are. Everybody has had the experience of going into a search engine, search for two words, and you find the two words at completely different parts of the same document. Often that may not as well be a match at all. Often a concept of Near, which is interesting because it’s a fuzzy sort of concept, SQL Server Full Text indexing understands that. It knows things like words that are in a different sentence are a little bit further apart, and words that are in a different paragraph are a bit further apart as well. There was a fixed version of Near in SQL Server 2008, and that’s deprecated now. What we have now is a customizable version of this proximity. What we can do – the examples I’ve got here – I’ve said "where contains" in the description, I’m after the word "David" near the word "Galvin". I want it within two words and the "true" at the end says that the order of the words matters. So if Galvin came before David that wouldn’t match. Now in the second example I’ve said I’m after the word “kite” somewhere within 30 words of the word “flying” and I don’t care what order they’re in.

 

Full Text indexing has been a really interesting concept for allowing us to drill in and find things in documents way more capable than using a like with a term percent term to do searches. Because if I used a sub string it doesn’t give you what you want. If I search the word “pen”, I get “pencil”, I get “open”, I get “penitentiary”. I get everything that has nothing to do with what I’m looking for. So searching for words is really important, because I also don’t then have to search the entire set of documents. I can just look up an index and find those words. So it’s been fast to do that, but often what we want is more than that. We want to be able to index and work on words but often we’re looking for meaning beyond that. An amazing addition to the product in SQL Server 2012 is the ability to pick up semantics or meaning from the data as well. Over the top of this we now have a semantic key phrase table.

 

Now many people have seen things like a tag cloud before and what this is looking for are phrases or numbers of words together perhaps that are statistically significant in how often they occur inside the document. So I’m able to find documents that are about some topic in a much more interesting way. An additional one that is particularly powerful here, once I find the document of interest, I also have a statistic similarity table that says go and find me other documents based upon the key phrases found in this document that are statistically significant or similar to this document. So I can find something of interest and find all the things that matter in it, then go off and find other documents or other rows in the table that are of significance as well in a related way. With the combination of FileTable and Full Text indexing with statistical semantic search, what I can now do is take entire folders of documents, index them where I could use them for Full Text search, and then go off and do semantic key phrase searching over the top of them as well.

 

In an upcoming demonstration I’ll show you how we get on accessing documents using this. If you have not looked at Full Text indexing, I can only encourage you to take another look at it. I would only encourage you to take another look at it if you’ve looked at it earlier on and decided it wasn’t for you. If that was some time back have another look. Otherwise I’d suggest go and have a look at it if you haven’t tackled it.

Return to SQL Server 2012 Developer Training Kit BOM (en-US)


Leave a Comment
  • Please add 8 and 1 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Patris_70 edited Revision 2. Comment: deleted (en-US) title

Page 1 of 1 (1 items)
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Comments
  • Patris_70 edited Revision 2. Comment: deleted (en-US) title

Page 1 of 1 (1 items)