Tuesday, April 09, 2013

How I added 250.000+ documents and 3.5 Million pieces of metadata to 152 #SP2010 document libraries


The approach I took was the following:

First, created a separate site collection to hold the 8000+ documents, just from a content database point of view this made most sense.

After enabling all branding features on this site collection, I created a template document library which had the appropriate content type enabled, correct view and managed metadata navigation-configuration.

Next, I used a tool called DocKIT from VYAPIN software to create a basic-metadata-excel document from a file share which contained the documents that need to go to SharePoint.
http://www.vyapin.com/products/sharepoint-migration/dockit/sharepoint-migration.htm

This basic .XLS contains all file references (path) and per record (file), you can add the metadata you need to fill the content type in the destination location (library). Because it's all in excel (the tool reads from this document later), it's easy to copy and paste the metadata in the document / complete the metadata in bulk and offline.

Next, I created a document library per 'company-documents'; this was done mainly for security reasons. (Persons who have rights to company-A's document, don't necessarily have rights to company-B's documents. To manage these rights in bulk, separate document-libraries are used.

I started off by creating separate sites (sub-sites) per company (152 in total), using a powershell script to create the sub-sites, but I left this idea after some sites seemed to be inaccessible (just did not exist) for some users and for the import tool. Very strange behaviour from SP2010 but I didn't have time to do huge research on why this was happening.

The metadata.xls document contained the 'destination library location' for each document (again, a copy & paste within excel makes this definition very fast) and after completing all metadata for the 8000+ documents it was ready to use the file as import data for the DocKIT tool.
This tool reads the file from it's "Path" location, uploads it to the destination library on SharePoint, attaches the right content type to the file and applies the configured metadata, all automatically.

After completion, 8000+ documents were uploaded in 152 document libraries, mostly automated :)
Each document contains 14 pieces of metadata, in total more then 112000 metadata additions in SharePoint 2010.

I created the document libraries automatically, based on the data in the excel document (company name is a piece of metadata that was available as the document library name) , using the library template as one of the parameters.

I'm happy to share more details if need-be on the configuration of DocKIT or any of the other items described here :)

Update: using the same principle, I also created a "Public Domain Documents Silo" for this client, uploading 220.000 documents (mostly OCR'd PDF's), all with around 15 pieces of metadata attached. (I wrote a little C# program that got most of the existing information like folder structure into the metadata.xls document).
This site collection is fully search able (SharePoint 2010 search) and VERY fast, under 3 seconds. The nice thing is that because all documents have so much metadata, the search if fully refine able. Creating a great end-user experience.