One of the SharePoint data sources are HTML Web pages.
SharePoint crawler retrieves metadata from Web content as crawled properties.
Default crawled properties available before you perform a first crawl are listed here[1].
List of crawled properties as well as list of crawled property categories will change after the first crawl to include additional crawled properties and categories found during that crawl.
One example of custom crawled properties Sharepoint is retrieving from Web pages are HTML meta tags content.
1 Overview
The crawl properties that the SharePoint search crawler creates for the custom HTML meta tags are rendered to following crawled properties categories:
Web is standard crawled properties category. Crawled properties are added to this category with the same name as the name of the custom meta tags but made uppercase.
Document Parser is custom crawled properties category. Crawled properties are added to this category with the same name as the name of the custom meta tags and in the same case.
Example:
Web page http://authors.library.caltech.edu/37214/ contains meta tag -
<meta name="eprints.citation" content=" DiMarco, E. Joseph and Khabiboulline, Emil and Orris, Darryl F. and Tartaglia, Michael A. and Terechkine, Iouri (2013) Superconducting Solenoid Lens for a High Energy Part of a Proton Linac Front End. IEEE Transactions on Applied Superconductivity, 23 (3). Art. No. 4100905. ISSN 1051-8223 http://resolver.caltech.edu/CaltechAUTHORS:20130228-145009767 <http://resolver.caltech.edu/CaltechAUTHORS:20130228-145009767> " />
After crawling this page SharePoint crawler will create following crawled properties:
Web category -
Property name: EPRINTS.CITATION Category: Web Property Set ID: d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1
Document Parser category -
Property name: eprints.citation Category: Document Parser Property Set ID: 64ae120f-487d-445a-8d5a-5258f99cb970
2 HTML meta tags and corresponding crawled properties
There are many meta tags namespaces used by different publishers.
The most common are:
Highwire Press tags (e.g., citation_title),
Eprints tags (e.g.,eprints.title),
Dublin Core tags (e.g., DC.title)
2.1 Crawled properties for Citation Meta tags
WEB
Document Parser
Content
CITATION_ABSTRACT
citation_abstract
abstract
CITATION_AUTHOR
citation_author
author
CITATION_AUTHORS
citation_authors
authors
CITATION_DATE
citation_date
date
CITATION_PUBLICATION_DATE
citation_publication_date
CITATION_ONLINE_DATE
citation_online_date
CITATION_YEAR
citation_year
CITATION_DOI
citation_doi
doi
CITATION_FIRST_PAGE
citation_first_page
first page
CITATION_ID
citation_id
CITATION_ISSN
citation_issn
issn
CITATION_ISBN
citation_isbn
Isbn
CITATION_ISSUE
citation_issue
issue
CITATION_JOURNAL_TITLE
citation_journal_title
journal title
CITATION_LAST_PAGE
citation_last_page
last page
CITATION_PUBLISHER
citation_publisher
publisher
CITATION_TITLE
citation_title
title
CITATION_VOLUME
citation_volume
volume
2.2 Crawled properties for Eprints Meta tags
EPRINTS.ABSTRACT
eprints.abstract
EPRINTS.CREATORS_NAME
eprints.creators_name
EPRINTS.DATE
eprints.date
EPRINTS.ID_NUMBER
eprints.id_number
EPRINTS.CITATION
eprints.citation
EPRINTS.PAGERANGE
eprints.pagerange
pages
EPRINTS.ISSN
eprints.issn
EPRINTS.NUMBER
eprints.number
Issue
EPRINTS.PUBLICATION
eprints.publication
EPRINTS.PUBLISHER
eprints.publisher
EPRINTS.VOLUME
eprints.volume
EPRINTS.OFFICIAL_URL
eprints.official_url
EPRINTS.TITLE
eprints.title
EPRINTS.TYPE
eprints.type
document type
2.3 Crawled properties for Dublin Core Meta tags
DC.DESCRIPTION.ABSTRACT
DC.description.abstract
DC.DESCRIPTION
DC.description
DC.CREATOR.PERSONALNAME
DC.creator.personalname
DC.CREATOR
DC.creator
DC.CONTRIBUTOR
DC.contributor
DC.DATE.CREATED
DC.date.created
DC.DATE
DC.date
DC.IDENTIFIER.DOI
DC.identifier.doi
DC.CITATION.PAGE
DC.citation.page
DC.IDENTIFIER.ISSN
DC.identifier.issn
DC.SOURCE.ISSN
DC.source.issn
Issn
DC.PUBLISHER
DC.publisher
DC.CITATION.VOLUME
DC.citation.volume
DC.IDENTIFIER
DC.identifier
DC.RELATION
DC.relation
DC.SOURCE
DC.source
DC.TITLE
DC.title
DC.TYPE
DC.type
[1] TechNet: Crawled properties reference (SharePoint Server 2010)
Richard Mueller edited Revision 9. Comment: Fixed heading in HTML
Richard Mueller edited Revision 8. Comment: Fixed a heading, removed extra space in tag "SharePoint 2013", added tags