Parsing Product Specifications to Populate 1C-Bitrix
Specifications are the foundation of catalog filtering. Without correctly populated properties (b_iblock_element_property), the 1C-Bitrix smart filter does not work, faceted search returns zero results, and customers cannot find the products they need. Parsing specifications is technically more complex than parsing descriptions: the goal is not simply to extract text but to recognize the "parameter name — value" structure and place the data correctly into infoblock properties.
Formats of specifications on source sites
Specification tables come in several forms:
HTML table (<table>) — the classic format, parsed via XPath //table//tr. The first cell in a row is the name, the second is the value.
dl/dt/dd list — commonly used in modern stores. Parse dt+dd pairs.
JSON-LD or schema.org microdata — the ideal format. Data is already structured; no need to parse HTML:
preg_match('/<script type="application\/ld\+json">(.*?)<\/script>/s', $html, $m);
$data = json_decode($m[1], true);
JS variables — data in window.productData or __REDUX_STATE__. Extract via regex.
Normalizing specification names
Different sources name the same attribute differently: "Weight", "Net mass", "Weight (kg)". Direct mapping to an infoblock property without normalization creates chaos.
Solution: an alias table property_aliases:
CREATE TABLE parser_property_aliases (
alias VARCHAR(255),
canonical_name VARCHAR(255),
property_code VARCHAR(100)
);
During parsing, every found name is looked up in the alias table. If not found — log it as an "unknown property" for manual review and addition to the dictionary.
Mapping to infoblock properties
Infoblock properties (b_iblock_property) have types: S (string), N (number), L (list), E (element link). For specifications, typically:
-
S— text values ("Color: red") -
N— numeric values with a unit of measurement (PROPERTY_TYPE = N,USER_TYPEempty) -
L— fixed list of values (important for filter performance)
For the 1C-Bitrix smart filter (bitrix:catalog.smart.filter), L type values perform faster than S — they are indexed in b_iblock_element_prop_enum.
Creating an L type value during import:
// Get or create an enum value
$propEnum = CIBlockPropertyEnum::GetList([], [
'PROPERTY_ID' => $propId,
'VALUE' => $parsedValue
])->Fetch();
if (!$propEnum) {
CIBlockPropertyEnum::Add(['PROPERTY_ID' => $propId, 'VALUE' => $parsedValue]);
}
Handling units of measurement
Sources provide "10 kg", "10kg", "10 kilograms". A unit parser is needed: split the number from the unit, normalize the unit to a standard form. Simple regex: /^([\d.,]+)\s*(.*)$/.
Numeric property values in 1C-Bitrix are stored as strings in b_iblock_element_property.VALUE — it is better to put units in a separate property or include them in the property CODE (WEIGHT_KG).
Case study: electronics, 15,000 SKUs, 120+ specification types
Goal: populate properties for the smart filter across laptops, phones, and TVs — three infoblocks with different property sets.
Implementation:
- Parsing from the manufacturer's site via JSON-LD (70% of products) + HTML table (30%)
- Alias dictionary of 380 entries, built during the first 3 days of development
- All numeric specifications — type
N, list-based values (brand, color, country) — typeL - 5 workers running in parallel via PHP-CLI, each handling its own category
Result: the smart filter worked correctly across 48 parameters after 2 iterations of alias dictionary debugging.
Work timeline
| Phase | Duration |
|---|---|
| Analyzing the source's specification structure | 4–8 hours |
| Developing the specification parser | 2–3 days |
| Building the alias dictionary, normalization | 1–2 days |
| Configuring infoblock properties for the filter | 1 day |
| Importing data, debugging property types | 1–2 days |
| Verifying smart filter functionality | 4–8 hours |
Total: 7–12 working days — this is one of the most labor-intensive catalog population tasks.

