SharePoint Search Service: Regex Crawl Rules for File Types

Published on
2 mins read
--- views

Task: Configure SharePoint Search Service so that it will only show pages of the following types: documents (.doc/.docx/.pdf), tables (.xls, .xlsx), and aspx pages (.aspx).

Instruction:

  1. First, you need to compose a regex to solve your task. In my case, the regex would be:
  http://<host>/.*(.aspx|.doc(x)?|.xls(x)?|.pdf)
regex-101

I recommend using the very helpful site https://regex101.com/ for composing and testing your regular expressions.

  1. Copy your regex and navigate to SharePoint Admin CenterServices, find the search service and go to ManageCrawl Rules. Add a new crawl rule with the include type (ATTENTION: remove backslashes from the regex you got in step 1). Enable the checkbox Follow complex URLs also!
new-crawl-rule
new-crawl-rule-edit
  1. Click save. On the page where you added the new rule, you can also test some links and see if the page will be covered by this rule. For example:
check-crawl-rule
  1. Also, you have to add a global exclude rule for all content sources, with lower priority than the include rules (ATTENTION: add include and exclude rules for all content sources). In my case, the exclude rule regex will be:
https://host/.*
  1. Save all rules and run a full index scan. Check crawled pages in the crawl log.
  2. PROFIT!