Small .pdf files reporting – “The item has been truncated in the index because it exceeds the maximum size”

Recently, I encountered an issue with SharePoint 2013 search crawls where .pdf files smaller than 1 MB were reporting a warning: “The item has been truncated in the index because it exceeds the maximum size”. The default MaxDownLoadSize for documents in SharePoint is 64MB, which was more than enough the handle these relatively small .pdf files.

After I reached out to some co-workers; one had suggested that the error might be a false-positive and the entire document had been crawled. I tested this by first searching for words at the end of the document and no matches were found; this would be expected if it were truncated. Next, I tried searching for some text in the middle of the document and no matches were found either. I thought it must have truncated a lot of text and tried searching for text contained at the very beginning of the document. No results were found! So when the warning said it had truncated the item, it had truncated the whole document.

I decided to test a Microsoft Word document of approximately the same size and found that it did not throw a warning. I then exported that Word document to .pdf and crawled it and surprisingly no warning was thrown. I took a look at the properties of both .pdf files. I found that problem .pdf file was version 1.3 (Acrobat 4.x) and had been generated by Microsoft SQL Server Reporting Services (SSRS), while my test document was version 1.6 (Acrobat 6.x).

Figure 1. PDF Version 1.3 (Acrobat 4.x) generated by SQL Server Reporting Services (SSRS).

Figure 2. PDF Version 1.5 (Acrobat 6.x) generated by Microsoft Word

This got me thinking that problem was version 1.3, so I converted my .pdf to 1.3 and crawled it. I was surprised to find that it did not throw the warning as I expected it too. This would mean that either 1) the culprit was SSRS report generation (not the version 1.3), or 2) there was something about converting a version 1.5 to a version 1.3 that fixed the issue. With this in mind, I preferred to find a centralized solution rather than converting every .pdf document in our site.

I decided to try using the official Adobe IFilter in place of the one that is built-in to SharePoint 2013. In order to leverage the custom IFilter, I did the following:

Add-PSSnapin "Microsoft.SharePoint.PowerShell" -ErrorAction SilentlyContinue
$iFilterPath = "C:\Program Files\Adobe\Adobe PDF iFilter 11 for 64-bit platforms\bin\"
$oldPath=(Get-ItemProperty -Path 'Registry::HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Environment' -Name PATH).Path
if(!$oldPath.Contains($ifilterPath))
{
    Write-Host "Adding Environment Path"
    $newPath="$oldPath;$iFilterPath"
    Set-ItemProperty -Path 'Registry::HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Environment' -Name PATH –Value $newPath
}

Write-Host "Setting Up Registry entries"
#See: http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf
$path = "Registry::HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\15.0\Search\Setup\Filters\.pdf"
if(Test-Path $path) {Remove-Item -Path $path -Recurse}
New-Item $path
New-ItemProperty -Path $path -Name Extension -PropertyType String -Value "pdf"
New-ItemProperty -Path $path -Name FileTypeBucket -PropertyType DWord -Value 1
New-ItemProperty -Path $path -Name MimeTypes -PropertyType String -Value "application/pdf"

$path = "Registry::HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\15.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdf"
if(Test-Path $path) {Remove-Item -Path $path -Recurse}
New-Item $path
Remove-Item -Path $path\* -Recurse
New-ItemProperty -Path $path -Name "(Default)" -PropertyType MultiString -Value "{E8978DA6-047F-4E3D-9C78-CDBE46041603}"



$ssa = Get-SPEnterpriseSearchServiceApplication
$filter = Get-SPEnterpriseSearchFileFormat -SearchApplication $ssa -Identity pdf
if(!$filter.UseIFilter)
{
    Write-Host "Enabling IFilter for SharePoint"
    Set-SPEnterpriseSearchFileFormatState -SearchApplication $ssa -Identity pdf -UseIFilter $true -Enable $true

    Write-Host "Restarting Search Service"
    $service = Get-SPServiceInstance | ? {$_.TypeName -eq "Search Host Controller Service"}
    $service | Stop-SPServiceInstance -Confirm:$false
    while(-not (($service | Where-Object {$_.Status -eq "Disabled"}).Count -eq $service.Count)){
        write-host -ForegroundColor Yellow $service.Status; sleep 5;
        $service = Get-SPServiceInstance | ? {$_.TypeName -eq "Search Host Controller Service"}
    } 
    $service | Start-SPServiceInstance
    Write-Host "Restarting IIS"
    iisreset
}

After executing the script, I performed a full crawl and was pleased to find that all of the .pdf size-related warnings were gone. I performed a search using the same criteria as before from the end, middle, and beginning of document and this time results were returned.

Conclusion

While I am not certain if the cause of this issue is SSRS document generation or legacy .pdf version 1.3 documents, I was able to solve this issue using the official IFilter from Adobe. If someone else encounters this problem, hopefully this will help.