name: field-extraction-parsing description: Extract structured fields from unstructured log data using OPAL parsing functions. Covers extract_regex() for pattern matching with type casting, split() for delimited data, parse_json() for JSON logs, and JSONPath for navigating parsed structures. Use when you need to convert raw log text into queryable fields for analysis, filtering, or aggregation.

Field Extraction and Parsing

Overview

This skill covers converting unstructured log data into structured, queryable fields using OPAL's extraction and parsing functions.

Core Functions:

extract_regex() - Extract fields using regex patterns with named capture groups
split() / split_part() - Split delimited strings
parse_json() - Parse JSON strings into objects
JSONPath / Array access - Navigate parsed structures

When to Use This Skill

Use field extraction when you need to:

Parse log formats - Extract timestamp, level, message from structured logs
Extract identifiers - Pull out request IDs, trace IDs, user IDs for correlation
Parse metrics from text - Extract numbers, durations, status codes from logs
Structure unstructured data - Convert free-form text into queryable columns
Parse embedded JSON - Extract fields from JSON-formatted log messages

Function 1: extract_regex()

The most powerful extraction function. Uses POSIX regex with named capture groups to create new columns.

Syntax

extract_regex source_column, /(?P<field_name>pattern)/
extract_regex source_column, /(?P<field_name::typecast>pattern)/

Key Features:

Named capture groups create new columns: (?P<column_name>pattern)
Type casting in capture group: (?P<name::int64>pattern)
Multiple captures in one regex
Regex uses forward slashes /pattern/ not quotes

Supported Type Casts

string (default)
int64, float64
parse_isotime (for ISO timestamps)
duration, duration_ms, duration_sec, duration_min, duration_hr
parse_json

Pattern 1: Extract Timestamp and Log Level

Use Case: Parse structured application logs

Log Format: [2025-11-16 01:58:12,204] INFO [Component] Message...

filter container = "kafka"
extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) /

Creates Columns:

log_time: "2025-11-16 02:02:45,266"
level: "INFO"

Use For: Java logs, Kafka logs, structured application logs

Pattern 2: Extract with Type Casting

Use Case: Extract numeric values as integers

Log Format: [SnapshotEmitter id=1] Message...

extract_regex body, /\[(?P<component>\w+) id=(?P<component_id::int64>\d+)\]/

Creates Columns:

component: "SnapshotEmitter" (string)
component_id: 1 (int64)

Key Point: ::int64 casts the extracted value to integer immediately

Pattern 3: Extract HTTP Request Details

Use Case: Parse access log patterns

Log Format: GET /api/users/123 200 15ms

extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/

Creates Columns:

method: "GET"
path: "/api/users/123"
status: 200 (int64)
duration_ms: 15 (int64)

Use For: Nginx, Apache, application access logs

Pattern 4: Extract Request ID for Correlation

Use Case: Pull out trace/request IDs for distributed tracing

Log Format: request_id=GHhaU0_7TcVSXpICZ9lh [info] GET /api

extract_regex body, /request_id=(?P<request_id>[\w\-]+)/

Creates Columns:

request_id: "GHhaU0_7TcVSXpICZ9lh"

Then correlate:

extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/
statsby count(), group_by(request_id, method, path)

Use For: Request correlation, distributed tracing, debugging user sessions

Pattern 5: Extract Key=Value Pairs

Use Case: Parse structured key=value log formats

Log Format: user=john action=login result=success duration=150ms

extract_regex body, /user=(?P<user>\w+) action=(?P<action>\w+) result=(?P<result>\w+) duration=(?P<duration_ms::int64>\d+)ms/

Creates Columns:

user: "john"
action: "login"
result: "success"
duration_ms: 150 (int64)

Use For: Audit logs, security logs, custom application logs

Pattern 6: Extract IP Addresses

Use Case: Parse network information from logs

Log Format: Connection from IP=192.168.1.100 to Destination="10.0.0.5"

extract_regex body, /IP=(?P<source_ip>[\d\.]+) to Destination="(?P<dest_ip>[\d\.]+)"/

Creates Columns:

source_ip: "192.168.1.100"
dest_ip: "10.0.0.5"

Use For: Network logs, firewall logs, connection tracking

Function 2: split() and split_part()

Split delimited strings into arrays or extract specific parts.

Syntax

split(string, delimiter)           # Returns JSON array
split_part(string, delimiter, N)   # Returns Nth part (1-based)

Key Differences:

split() returns entire array: ["a", "b", "c"]
split_part() returns single element (1-based indexing)
Negative indices in split_part() count from end: -1 = last

Pattern 7: Split IP Address into Octets

Use Case: Parse IP address components

extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col octets:split(ip, ".")
make_col first_octet:split_part(ip, ".", 1)
make_col last_octet:split_part(ip, ".", -1)

Results:

ip: "95.217.183.1"
octets: ["95", "217", "183", "1"]
first_octet: "95"
last_octet: "1"

Use For: Network analysis, IP classification

Pattern 8: Parse Path Components

Use Case: Extract parts of file paths or URLs

make_col path_parts:split("/var/log/app/error.log", "/")
make_col filename:split_part("/var/log/app/error.log", "/", -1)
make_col directory:split_part("/var/log/app/error.log", "/", -2)

Results:

path_parts: ["", "var", "log", "app", "error.log"]
filename: "error.log"
directory: "app"

Use For: File path analysis, URL parsing

Pattern 9: Parse CSV-Like Data

Use Case: Extract fields from comma-separated values in logs

extract_regex body, /data:(?P<csv_data>[\w,]+)/
make_col fields:split(csv_data, ",")
make_col field1:split_part(csv_data, ",", 1)
make_col field2:split_part(csv_data, ",", 2)
make_col field3:split_part(csv_data, ",", 3)

Use For: Legacy systems, CSV exports in logs

Function 3: parse_json()

Parse JSON strings into queryable objects.

Syntax

parse_json(json_string)

Returns: OPAL object that can be accessed with JSONPath

Pattern 10: Parse JSON from Logs

Use Case: Extract fields from JSON-formatted log messages

Log Format: MetricsExporter {"kind": "exporter", "data_type": "metrics", "metrics": 23, "data points": 61}

filter body ~ /MetricsExporter/
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)

Result:

json_data: {"kind": "exporter", "data_type": "metrics", ...} (string)
parsed: {"kind": "exporter", "data_type": "metrics", ...} (object)

Next: Access fields using JSONPath (see below)

Function 4: JSONPath and Array Access

Navigate parsed JSON objects and arrays.

Syntax

object.field_name              # Simple field
object."field with spaces"     # Quoted for special chars
array[0]                       # Zero-based array indexing
object.nested.field           # Nested access

Critical: Field names with spaces or special characters MUST be quoted

Pattern 11: Access JSON Fields

Use Case: Extract specific fields from parsed JSON

extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
make_col data_type:string(parsed."data_type")
make_col metrics_count:int64(parsed.metrics)
make_col data_points:int64(parsed."data points")

Results:

data_type: "metrics" (string)
metrics_count: 23 (int64)
data_points: 61 (int64)

Key Points:

Use quotes for "data_type" and "data points" (special chars/spaces)
Type cast with int64(), string(), etc.

Pattern 12: Array Access with Split

Use Case: Access specific array elements

extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col parts:split(ip, ".")
make_col first_octet:int64(parts[0])
make_col second_octet:int64(parts[1])
make_col third_octet:int64(parts[2])
make_col fourth_octet:int64(parts[3])

Key Point: Array indexing is zero-based ([0] = first element)

Difference from split_part():

split_part(): 1-based (first element = 1)
Array [N]: 0-based (first element = 0)

Complete Examples

Example 1: Parse Application Errors

Goal: Extract error codes and messages from application logs

filter body ~ /ERROR/
extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) \[(?P<component>\w+)\] (?P<error_code>\w+): (?P<message>.*)/
statsby error_count:count(), sample:any(message), group_by(error_code, component)
sort desc(error_count)

Use Case: Error analysis, identifying most common errors

Example 2: Parse and Analyze HTTP Status Codes

Goal: Analyze HTTP response codes and response times

filter body ~ /\d{3} \d+ms/
extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/
make_col status_class:if(status >= 500, "5xx",
                       if(status >= 400, "4xx",
                       if(status >= 300, "3xx",
                       if(status >= 200, "2xx", "other"))))
statsby request_count:count(),
        avg_duration:avg(duration_ms),
        p95_duration:percentile(duration_ms, 0.95),
        group_by(status_class, path)
sort desc(request_count)

Use Case: Performance analysis, identifying slow endpoints

Example 3: Correlate Requests Across Services

Goal: Track requests through multiple services using request_id

filter body ~ /request_id=/
extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/
make_col service:string(resource_attributes."k8s.deployment.name")
statsby services:count_distinct(service),
        total_logs:count(),
        group_by(request_id)
filter services > 1
sort desc(services)

Use Case: Distributed tracing, identifying cross-service requests

Example 4: Parse JSON Metrics and Alert

Goal: Extract metrics from JSON logs and find anomalies

filter body ~ /MetricsExporter/
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
make_col data_points:int64(parsed."data points")
make_col metrics_count:int64(parsed.metrics)
filter data_points > 200 or metrics_count > 50
statsby high_count:count(), avg_points:avg(data_points), group_by(metrics_count)

Use Case: Monitoring metric collection, detecting unusual activity

Example 5: Network Traffic Analysis

Goal: Analyze network connections by IP range

filter body ~ /IP=/
extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col octets:split(ip, ".")
make_col network:split_part(ip, ".", 1)
make_col is_private:if(network = "10" or network = "172" or network = "192", true, false)
statsby connection_count:count(), unique_ips:count_distinct(ip), group_by(is_private, network)
sort desc(connection_count)

Use Case: Security analysis, network traffic patterns

Decision Tree: Which Function to Use?

Need to extract data from logs?
│
├─ Data has clear pattern (timestamp, IP, ID)
│  └─ Use extract_regex() with named captures
│
├─ Data is delimited (CSV, path, separated values)
│  ├─ Need all parts → Use split()
│  └─ Need specific part → Use split_part()
│
├─ Data is JSON formatted
│  ├─ Extract JSON first → extract_regex()
│  ├─ Parse JSON → parse_json()
│  └─ Access fields → JSONPath (object.field)
│
└─ Data is mixed (pattern + delimited + JSON)
   └─ Combine: extract_regex() → split() → parse_json()

Common Mistakes and Solutions

Mistake 1: Using Reserved Column Names

ERROR:

extract_regex body, /(?P<timestamp>[\d\-:]+)/
# Error: regex capture group 1 overwrites 'valid from' column "timestamp"

FIX:

extract_regex body, /(?P<log_time>[\d\-:]+)/

Reserved names: timestamp, valid_from, valid_to, _c_bucket

Mistake 2: Forgetting Timestamp in pick_col

ERROR:

extract_regex body, /(?P<field>\w+)/
pick_col field
# Error: need to pick 'valid from' column "timestamp"

FIX:

extract_regex body, /(?P<field>\w+)/
pick_col timestamp, field

Mistake 3: Wrong Regex Delimiters

ERROR:

extract_regex body, "pattern"     # Quotes don't work

FIX:

extract_regex body, /pattern/     # Forward slashes required

Mistake 4: Tab Character in Regex

ERROR:

extract_regex body, /field\t(?P<value>.*)/
# Error: Unknown function 't()'

FIX:

extract_regex body, /field.(?P<value>.*)/        # Use . for any char
# OR
extract_regex body, /field[\t ](?P<value>.*)/    # Character class

Mistake 5: JSONPath Without Quotes

ERROR:

string(parsed.data points)    # Syntax error (space in name)

FIX:

string(parsed."data points")  # Quote field names with spaces

Mistake 6: Confusing split_part() vs Array Indexing

Remember:

split_part() is 1-based (first element = 1)
Array [N] is 0-based (first element = 0)

make_col parts:split("a.b.c", ".")
make_col using_split_part:split_part("a.b.c", ".", 1)  # "a" (1-based)
make_col using_array:parts[0]                          # "a" (0-based)

Error Handling

Regex Non-Matches

When regex doesn't match a log line:

Extracted columns are null (not an error)
Original data is preserved
Filter nulls if needed: filter extracted_field != null

extract_regex body, /user=(?P<user>\w+)/
# Logs without "user=" will have user=null
filter user != null  # Keep only matched logs

Invalid JSON

When parse_json() receives invalid JSON:

Returns null (not an error)
Check before accessing: filter parsed != null

make_col parsed:parse_json(maybe_json)
filter parsed != null
make_col field:string(parsed.field_name)

Array Out of Bounds

When accessing array[999] on a smaller array:

Returns null (not an error)
No exception thrown

make_col parts:split("a.b.c", ".")  # ["a","b","c"]
make_col safe:parts[0]               # "a"
make_col oob:parts[999]              # null (no error)

Performance Tips

1. Filter Before Extracting

Extract from relevant logs only:

# GOOD - Filter first
filter body ~ /ERROR/
extract_regex body, /ERROR: (?P<error_code>\w+)/

# BAD - Extract from all logs
extract_regex body, /ERROR: (?P<error_code>\w+)/
filter error_code != null

2. Single Regex with Multiple Captures

One regex is faster than multiple:

# GOOD - Single regex
extract_regex body, /\[(?P<time>[\d: ,]+)\] (?P<level>\w+) (?P<msg>.*)/

# BAD - Multiple regexes
extract_regex body, /\[(?P<time>.*)\]/
extract_regex body, /\] (?P<level>.*) /
extract_regex body, /(?P<msg>.*)/

3. Anchor Patterns When Possible

Anchored patterns (^, $) perform better:

# GOOD - Anchored
extract_regex body, /^(?P<timestamp>[\d\-:]+) /

# SLOWER - Unanchored (searches entire string)
extract_regex body, /(?P<timestamp>[\d\-:]+) /

4. Only Parse JSON When Needed

# GOOD - Filter, then parse
filter body ~ /\{.*"error"/
make_col parsed:parse_json(body)

# BAD - Parse everything
make_col parsed:parse_json(body)
filter parsed != null

Related Skills

filtering-event-datasets - Text search and filtering before extraction
aggregating-event-datasets - Aggregating extracted fields with statsby
investigating-textual-data - Error analysis workflows using extraction
analyzing-text-patterns - Pattern discovery to identify what to extract

Key Takeaways

extract_regex() is most powerful - supports type casting and multiple captures
Forward slashes required for regex: /pattern/ not "pattern"
Type cast in capture group: (?P<name::int64>pattern) for immediate conversion
split_part() is 1-based, array [N] is 0-based
Quote JSONPath fields with spaces or special characters
Reserved names: Avoid timestamp, valid_from, valid_to
Nulls not errors: Non-matches return null, not exceptions
Filter before extract: Better performance

When in doubt about regex syntax or parsing functions, use learn_observe_skill("OPAL extract_regex") for official documentation.

field-extraction-parsing

$ Installieren

Field Extraction and Parsing

Overview

When to Use This Skill

Function 1: extract_regex()

Syntax

Supported Type Casts

Pattern 1: Extract Timestamp and Log Level

Pattern 2: Extract with Type Casting

Pattern 3: Extract HTTP Request Details

Pattern 4: Extract Request ID for Correlation

Pattern 5: Extract Key=Value Pairs

Pattern 6: Extract IP Addresses

Function 2: split() and split_part()

Syntax

Pattern 7: Split IP Address into Octets

Pattern 8: Parse Path Components

Pattern 9: Parse CSV-Like Data

Function 3: parse_json()

Syntax

Pattern 10: Parse JSON from Logs

Function 4: JSONPath and Array Access

Syntax

Pattern 11: Access JSON Fields

Pattern 12: Array Access with Split

Complete Examples

Example 1: Parse Application Errors

Example 2: Parse and Analyze HTTP Status Codes

Example 3: Correlate Requests Across Services

Example 4: Parse JSON Metrics and Alert

Example 5: Network Traffic Analysis

Decision Tree: Which Function to Use?

Common Mistakes and Solutions

Mistake 1: Using Reserved Column Names

Mistake 2: Forgetting Timestamp in pick_col

Mistake 3: Wrong Regex Delimiters

Mistake 4: Tab Character in Regex

Mistake 5: JSONPath Without Quotes

Mistake 6: Confusing split_part() vs Array Indexing

Error Handling

Regex Non-Matches

Invalid JSON

Array Out of Bounds

Performance Tips

1. Filter Before Extracting

2. Single Regex with Multiple Captures

3. Anchor Patterns When Possible

4. Only Parse JSON When Needed

Related Skills

Key Takeaways

Repository

Actions

Related Skills