field-extraction-parsing
Extract structured fields from unstructured log data using OPAL parsing functions. Covers extract_regex() for pattern matching with type casting, split() for delimited data, parse_json() for JSON logs, and JSONPath for navigating parsed structures. Use when you need to convert raw log text into queryable fields for analysis, filtering, or aggregation.
$ Installieren
git clone https://github.com/rustomax/observe-community-mcp /tmp/observe-community-mcp && cp -r /tmp/observe-community-mcp/skills/field-extraction-parsing ~/.claude/skills/observe-community-mcp// tip: Run this command in your terminal to install the skill
name: field-extraction-parsing description: Extract structured fields from unstructured log data using OPAL parsing functions. Covers extract_regex() for pattern matching with type casting, split() for delimited data, parse_json() for JSON logs, and JSONPath for navigating parsed structures. Use when you need to convert raw log text into queryable fields for analysis, filtering, or aggregation.
Field Extraction and Parsing
Overview
This skill covers converting unstructured log data into structured, queryable fields using OPAL's extraction and parsing functions.
Core Functions:
extract_regex()- Extract fields using regex patterns with named capture groupssplit()/split_part()- Split delimited stringsparse_json()- Parse JSON strings into objects- JSONPath / Array access - Navigate parsed structures
When to Use This Skill
Use field extraction when you need to:
- Parse log formats - Extract timestamp, level, message from structured logs
- Extract identifiers - Pull out request IDs, trace IDs, user IDs for correlation
- Parse metrics from text - Extract numbers, durations, status codes from logs
- Structure unstructured data - Convert free-form text into queryable columns
- Parse embedded JSON - Extract fields from JSON-formatted log messages
Function 1: extract_regex()
The most powerful extraction function. Uses POSIX regex with named capture groups to create new columns.
Syntax
extract_regex source_column, /(?P<field_name>pattern)/
extract_regex source_column, /(?P<field_name::typecast>pattern)/
Key Features:
- Named capture groups create new columns:
(?P<column_name>pattern) - Type casting in capture group:
(?P<name::int64>pattern) - Multiple captures in one regex
- Regex uses forward slashes
/pattern/not quotes
Supported Type Casts
string(default)int64,float64parse_isotime(for ISO timestamps)duration,duration_ms,duration_sec,duration_min,duration_hrparse_json
Pattern 1: Extract Timestamp and Log Level
Use Case: Parse structured application logs
Log Format: [2025-11-16 01:58:12,204] INFO [Component] Message...
filter container = "kafka"
extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) /
Creates Columns:
log_time: "2025-11-16 02:02:45,266"level: "INFO"
Use For: Java logs, Kafka logs, structured application logs
Pattern 2: Extract with Type Casting
Use Case: Extract numeric values as integers
Log Format: [SnapshotEmitter id=1] Message...
extract_regex body, /\[(?P<component>\w+) id=(?P<component_id::int64>\d+)\]/
Creates Columns:
component: "SnapshotEmitter" (string)component_id: 1 (int64)
Key Point: ::int64 casts the extracted value to integer immediately
Pattern 3: Extract HTTP Request Details
Use Case: Parse access log patterns
Log Format: GET /api/users/123 200 15ms
extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/
Creates Columns:
method: "GET"path: "/api/users/123"status: 200 (int64)duration_ms: 15 (int64)
Use For: Nginx, Apache, application access logs
Pattern 4: Extract Request ID for Correlation
Use Case: Pull out trace/request IDs for distributed tracing
Log Format: request_id=GHhaU0_7TcVSXpICZ9lh [info] GET /api
extract_regex body, /request_id=(?P<request_id>[\w\-]+)/
Creates Columns:
request_id: "GHhaU0_7TcVSXpICZ9lh"
Then correlate:
extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/
statsby count(), group_by(request_id, method, path)
Use For: Request correlation, distributed tracing, debugging user sessions
Pattern 5: Extract Key=Value Pairs
Use Case: Parse structured key=value log formats
Log Format: user=john action=login result=success duration=150ms
extract_regex body, /user=(?P<user>\w+) action=(?P<action>\w+) result=(?P<result>\w+) duration=(?P<duration_ms::int64>\d+)ms/
Creates Columns:
user: "john"action: "login"result: "success"duration_ms: 150 (int64)
Use For: Audit logs, security logs, custom application logs
Pattern 6: Extract IP Addresses
Use Case: Parse network information from logs
Log Format: Connection from IP=192.168.1.100 to Destination="10.0.0.5"
extract_regex body, /IP=(?P<source_ip>[\d\.]+) to Destination="(?P<dest_ip>[\d\.]+)"/
Creates Columns:
source_ip: "192.168.1.100"dest_ip: "10.0.0.5"
Use For: Network logs, firewall logs, connection tracking
Function 2: split() and split_part()
Split delimited strings into arrays or extract specific parts.
Syntax
split(string, delimiter) # Returns JSON array
split_part(string, delimiter, N) # Returns Nth part (1-based)
Key Differences:
split()returns entire array:["a", "b", "c"]split_part()returns single element (1-based indexing)- Negative indices in
split_part()count from end:-1= last
Pattern 7: Split IP Address into Octets
Use Case: Parse IP address components
extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col octets:split(ip, ".")
make_col first_octet:split_part(ip, ".", 1)
make_col last_octet:split_part(ip, ".", -1)
Results:
ip: "95.217.183.1"octets:["95", "217", "183", "1"]first_octet: "95"last_octet: "1"
Use For: Network analysis, IP classification
Pattern 8: Parse Path Components
Use Case: Extract parts of file paths or URLs
make_col path_parts:split("/var/log/app/error.log", "/")
make_col filename:split_part("/var/log/app/error.log", "/", -1)
make_col directory:split_part("/var/log/app/error.log", "/", -2)
Results:
path_parts:["", "var", "log", "app", "error.log"]filename: "error.log"directory: "app"
Use For: File path analysis, URL parsing
Pattern 9: Parse CSV-Like Data
Use Case: Extract fields from comma-separated values in logs
extract_regex body, /data:(?P<csv_data>[\w,]+)/
make_col fields:split(csv_data, ",")
make_col field1:split_part(csv_data, ",", 1)
make_col field2:split_part(csv_data, ",", 2)
make_col field3:split_part(csv_data, ",", 3)
Use For: Legacy systems, CSV exports in logs
Function 3: parse_json()
Parse JSON strings into queryable objects.
Syntax
parse_json(json_string)
Returns: OPAL object that can be accessed with JSONPath
Pattern 10: Parse JSON from Logs
Use Case: Extract fields from JSON-formatted log messages
Log Format: MetricsExporter {"kind": "exporter", "data_type": "metrics", "metrics": 23, "data points": 61}
filter body ~ /MetricsExporter/
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
Result:
json_data:{"kind": "exporter", "data_type": "metrics", ...}(string)parsed:{"kind": "exporter", "data_type": "metrics", ...}(object)
Next: Access fields using JSONPath (see below)
Function 4: JSONPath and Array Access
Navigate parsed JSON objects and arrays.
Syntax
object.field_name # Simple field
object."field with spaces" # Quoted for special chars
array[0] # Zero-based array indexing
object.nested.field # Nested access
Critical: Field names with spaces or special characters MUST be quoted
Pattern 11: Access JSON Fields
Use Case: Extract specific fields from parsed JSON
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
make_col data_type:string(parsed."data_type")
make_col metrics_count:int64(parsed.metrics)
make_col data_points:int64(parsed."data points")
Results:
data_type: "metrics" (string)metrics_count: 23 (int64)data_points: 61 (int64)
Key Points:
- Use quotes for
"data_type"and"data points"(special chars/spaces) - Type cast with
int64(),string(), etc.
Pattern 12: Array Access with Split
Use Case: Access specific array elements
extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col parts:split(ip, ".")
make_col first_octet:int64(parts[0])
make_col second_octet:int64(parts[1])
make_col third_octet:int64(parts[2])
make_col fourth_octet:int64(parts[3])
Key Point: Array indexing is zero-based ([0] = first element)
Difference from split_part():
split_part(): 1-based (first element = 1)- Array
[N]: 0-based (first element = 0)
Complete Examples
Example 1: Parse Application Errors
Goal: Extract error codes and messages from application logs
filter body ~ /ERROR/
extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) \[(?P<component>\w+)\] (?P<error_code>\w+): (?P<message>.*)/
statsby error_count:count(), sample:any(message), group_by(error_code, component)
sort desc(error_count)
Use Case: Error analysis, identifying most common errors
Example 2: Parse and Analyze HTTP Status Codes
Goal: Analyze HTTP response codes and response times
filter body ~ /\d{3} \d+ms/
extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/
make_col status_class:if(status >= 500, "5xx",
if(status >= 400, "4xx",
if(status >= 300, "3xx",
if(status >= 200, "2xx", "other"))))
statsby request_count:count(),
avg_duration:avg(duration_ms),
p95_duration:percentile(duration_ms, 0.95),
group_by(status_class, path)
sort desc(request_count)
Use Case: Performance analysis, identifying slow endpoints
Example 3: Correlate Requests Across Services
Goal: Track requests through multiple services using request_id
filter body ~ /request_id=/
extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/
make_col service:string(resource_attributes."k8s.deployment.name")
statsby services:count_distinct(service),
total_logs:count(),
group_by(request_id)
filter services > 1
sort desc(services)
Use Case: Distributed tracing, identifying cross-service requests
Example 4: Parse JSON Metrics and Alert
Goal: Extract metrics from JSON logs and find anomalies
filter body ~ /MetricsExporter/
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
make_col data_points:int64(parsed."data points")
make_col metrics_count:int64(parsed.metrics)
filter data_points > 200 or metrics_count > 50
statsby high_count:count(), avg_points:avg(data_points), group_by(metrics_count)
Use Case: Monitoring metric collection, detecting unusual activity
Example 5: Network Traffic Analysis
Goal: Analyze network connections by IP range
filter body ~ /IP=/
extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col octets:split(ip, ".")
make_col network:split_part(ip, ".", 1)
make_col is_private:if(network = "10" or network = "172" or network = "192", true, false)
statsby connection_count:count(), unique_ips:count_distinct(ip), group_by(is_private, network)
sort desc(connection_count)
Use Case: Security analysis, network traffic patterns
Decision Tree: Which Function to Use?
Need to extract data from logs?
â
ââ Data has clear pattern (timestamp, IP, ID)
â ââ Use extract_regex() with named captures
â
ââ Data is delimited (CSV, path, separated values)
â ââ Need all parts â Use split()
â ââ Need specific part â Use split_part()
â
ââ Data is JSON formatted
â ââ Extract JSON first â extract_regex()
â ââ Parse JSON â parse_json()
â ââ Access fields â JSONPath (object.field)
â
ââ Data is mixed (pattern + delimited + JSON)
ââ Combine: extract_regex() â split() â parse_json()
Common Mistakes and Solutions
Mistake 1: Using Reserved Column Names
ERROR:
extract_regex body, /(?P<timestamp>[\d\-:]+)/
# Error: regex capture group 1 overwrites 'valid from' column "timestamp"
FIX:
extract_regex body, /(?P<log_time>[\d\-:]+)/
Reserved names: timestamp, valid_from, valid_to, _c_bucket
Mistake 2: Forgetting Timestamp in pick_col
ERROR:
extract_regex body, /(?P<field>\w+)/
pick_col field
# Error: need to pick 'valid from' column "timestamp"
FIX:
extract_regex body, /(?P<field>\w+)/
pick_col timestamp, field
Mistake 3: Wrong Regex Delimiters
ERROR:
extract_regex body, "pattern" # Quotes don't work
FIX:
extract_regex body, /pattern/ # Forward slashes required
Mistake 4: Tab Character in Regex
ERROR:
extract_regex body, /field\t(?P<value>.*)/
# Error: Unknown function 't()'
FIX:
extract_regex body, /field.(?P<value>.*)/ # Use . for any char
# OR
extract_regex body, /field[\t ](?P<value>.*)/ # Character class
Mistake 5: JSONPath Without Quotes
ERROR:
string(parsed.data points) # Syntax error (space in name)
FIX:
string(parsed."data points") # Quote field names with spaces
Mistake 6: Confusing split_part() vs Array Indexing
Remember:
split_part()is 1-based (first element = 1)- Array
[N]is 0-based (first element = 0)
make_col parts:split("a.b.c", ".")
make_col using_split_part:split_part("a.b.c", ".", 1) # "a" (1-based)
make_col using_array:parts[0] # "a" (0-based)
Error Handling
Regex Non-Matches
When regex doesn't match a log line:
- Extracted columns are null (not an error)
- Original data is preserved
- Filter nulls if needed:
filter extracted_field != null
extract_regex body, /user=(?P<user>\w+)/
# Logs without "user=" will have user=null
filter user != null # Keep only matched logs
Invalid JSON
When parse_json() receives invalid JSON:
- Returns null (not an error)
- Check before accessing:
filter parsed != null
make_col parsed:parse_json(maybe_json)
filter parsed != null
make_col field:string(parsed.field_name)
Array Out of Bounds
When accessing array[999] on a smaller array:
- Returns null (not an error)
- No exception thrown
make_col parts:split("a.b.c", ".") # ["a","b","c"]
make_col safe:parts[0] # "a"
make_col oob:parts[999] # null (no error)
Performance Tips
1. Filter Before Extracting
Extract from relevant logs only:
# GOOD - Filter first
filter body ~ /ERROR/
extract_regex body, /ERROR: (?P<error_code>\w+)/
# BAD - Extract from all logs
extract_regex body, /ERROR: (?P<error_code>\w+)/
filter error_code != null
2. Single Regex with Multiple Captures
One regex is faster than multiple:
# GOOD - Single regex
extract_regex body, /\[(?P<time>[\d: ,]+)\] (?P<level>\w+) (?P<msg>.*)/
# BAD - Multiple regexes
extract_regex body, /\[(?P<time>.*)\]/
extract_regex body, /\] (?P<level>.*) /
extract_regex body, /(?P<msg>.*)/
3. Anchor Patterns When Possible
Anchored patterns (^, $) perform better:
# GOOD - Anchored
extract_regex body, /^(?P<timestamp>[\d\-:]+) /
# SLOWER - Unanchored (searches entire string)
extract_regex body, /(?P<timestamp>[\d\-:]+) /
4. Only Parse JSON When Needed
# GOOD - Filter, then parse
filter body ~ /\{.*"error"/
make_col parsed:parse_json(body)
# BAD - Parse everything
make_col parsed:parse_json(body)
filter parsed != null
Related Skills
- filtering-event-datasets - Text search and filtering before extraction
- aggregating-event-datasets - Aggregating extracted fields with statsby
- investigating-textual-data - Error analysis workflows using extraction
- analyzing-text-patterns - Pattern discovery to identify what to extract
Key Takeaways
- extract_regex() is most powerful - supports type casting and multiple captures
- Forward slashes required for regex:
/pattern/not"pattern" - Type cast in capture group:
(?P<name::int64>pattern)for immediate conversion - split_part() is 1-based, array
[N]is 0-based - Quote JSONPath fields with spaces or special characters
- Reserved names: Avoid
timestamp,valid_from,valid_to - Nulls not errors: Non-matches return null, not exceptions
- Filter before extract: Better performance
When in doubt about regex syntax or parsing functions, use learn_observe_skill("OPAL extract_regex") for official documentation.
Repository
