General

Evaluating Conway

Oct 12, 2025 by Yogev Angelovici, Nahom Seyoum

Introduction

There are an unlimited number of ways to describe a dataset. To compare different discovery systems, evaluators must quantify utility of the reports, in addition to accuracy. Given these constraints, the following evaluation is based on datasets that quantify user preference. The three datasets selected are engineered to contain a set of distinct critical events that are thoroughly annotated by domain experts.

In this evaluation, Conway identified 20% of the critical events found in this data [1]. This success rate far exceeds the baseline performance of other tested models, more than doubling the next nearest score.

Our Datasets

Each dataset must contain realistic data (either a sample of real data collected or otherwise engineered to reflect realistically noisy data conditions. Each dataset must also be comprehensively-annotated by domain experts, with both information on the full set of distinct events and the way in which they are recorded in the data collected. Lastly, each dataset must contain a large quantity of collected datapoints.

Due to the effort required to generate datasets with these requirements, there are few available. We made a special request to two research institutions in order to collect the following datasets:

CICIDS 2017 (Canadian Institute for Cybersecurity Intrusion Detection System)

The CICIDS 2017 dataset [2] represents one of the most comprehensive and realistic network intrusion detection benchmarks available. Created by the Canadian Institute for Cybersecurity, it contains 2.4 million network flows captured over 5 days of controlled attack scenarios against a realistic corporate network infrastructure. The dataset includes 17 distinct attack types ranging from:

Brute force attempts (FTP, SSH, Web)
Sophisticated exploits (Heartbleed)
Denial of service variants (Slowloris, Hulk, GoldenEye)
Infiltration attempts (Metasploit, Cool Disk)
Botnet deployment (ARES)
Reconnaissance activities (port scanning)

WaDI (Water Distribution)

[3]

Raw water tanks
Chemical dosing systems
Elevated reservoirs
Consumer distribution networks

SWaT (Secure Water Treatment)

[4]

Raw water intake
Chemical dosing
Ultrafiltration
Dechlorination
Reverse osmosis
Clean water storage

Methodology

Setup

We evaluated each system using a standardized prompt that provided basic domain information about the dataset and asked the system to surface anything of interest to the user. No additional guidance, labels, or hints about specific attacks were provided during the run.
For network datasets (CICIDS), we provided: dataset size (2.4M flows), feature descriptions (source/destination IPs, ports, protocols, flow statistics), and an indication that this was a set of network logs. No information about attack periods or types was disclosed.
For cyber-physical datasets (WaDI/SWaT), we provided: sensor/actuator descriptions and basic knowledge that this was logs of a physical system relating to water management. No information about attack periods or types was disclosed.
Foundational models (Claude 4.5 Sonnet, GPT-5, Gemini 2.5 Pro) required wrapper agents instead of being able to do direct reasoning since the datasets were too large to directly upload as artefacts. For these wrappers, we chose to turn to the code editing agents made available by their respective labs. This gave each foundational model the maximum freedom to make its own analysis scripts and tools to use on the datasets.
We used Claude Code for Sonnet, Codex for GPT-5, and Gemini CLI for Gemini
Each agent was instantiated with a clean environment containing only the datasource.
No system had access to attack labels, ground truth, or the evaluation rubric during analysis.

Evaluation

Once each system completed its analysis, we collected all generated artefacts each system outputted, including written reports that were generated and direct responses to the user.
We evaluated each report against our 3-point rubric per attack, checking for: (1) Statistical artifact detection, (2) Entity identification, and (3) Root cause analysis.
For statistical artifacts, we accepted any valid detection that identified a pattern signature within ±2 minutes of the actual timeline's window.
For entity identification, we required pointing out exact matching entities involved in an incident. For network attacks, this meant correctly tracing through NAT translations. For cyber-physical attacks, this meant identifying specific sensors/actuators.
For root cause analysis, we accepted semantically equivalent descriptions of the cause and mechanism behind the event. For example, "credential stuffing," "brute force," and "dictionary attack" were all acceptable for FTP-Patator.
We calculated scores as: (Points earned / Maximum possible points) × 100%, where maximum points = (Number of events) × 3.
See the specific rubrics and criteria for each dataset here: https://github.com/conway-engineering/eval-rubrics

Results

Total Points by System

Conway

21.9%

Cursor

9.6%

Devin

9.6%

Claude

8.8%

Codex

7.9%

Gemini

Analysis

We found that most of the difference between Conway and competitors lay in discovering unexpected data patterns.

We noted that the performance of Conway and competitors in the CIC-IDS scenario was within a 7% margin (4 points of the 51 available). This seemed to be due to both Claude and ChatGPT having a good prior of the signs that would indicate cyberattacks. They excelled at catching attacks with widely known and obvious signatures like DDOS and port scanning.

However, in both the WaDI and SWaT scenarios, Conway had a lead in the double digits (11% and 15% better than the next best model for each dataset respectively). The other agents seemed to struggle to understand what kinds of patterns to look for and ended up performing poorly in both scenarios.

This supports the fact that Conway's disaggregation of pattern detection and pattern interpretation is the correct approach. Conway is able to surface patterns that other agents missed in the water treatment scenarios and score significantly higher.

Future Directions

There are several areas for improvement. The two most salient ones in these scenarios involved failures in pattern detection and reasoning over causal relationships. We trace both of these shortcomings to the need for more–and higher quality–context needed for the system to give definitive conclusions.

Most of the missed points were due to missed scenarios that had subtle signatures. Specifically, there was a case in which a class of patterns had typical statistical significance and high contextual significance. Pushing our detection mechanisms to integrate domain context and understand how to prioritize weak, contextually significant patterns is a key place to improve.

The other missed points, in scenarios that were correctly flagged, involved uncertainty on casual relationships within the data. Although Conway correctly flagged the statistical signature and entities involved, the system didn't directly have evidence or prior knowledge of the direction of causation, and failed to identify the third point in its entirety. This relationship requires more knowledge of process and dataset topology, which in the future can be provided by information beforehand or by more rigorous statistical investigations at test time.

Disclaimer

This benchmark is not a perfect measurement of our system's performance. Each of the test datasets had a small set of salient patterns that were each captured across multiple variables. There is often more nuance to pattern discovery and analysis in deployments. Factors like the quality and comprehensiveness of collected data have a large impact on system performance.

In order to better understand the performance of our system in these scenarios, we have an internal curated set of annotated datasets. We plan on sharing further results with the general public.

Citations

[1] Conway Research Team. (2024). Pattern Discovery Evaluation Results. Internal Research Report.

[2] Sharafaldin, I., Habibi Lashkari, A., & Ghorbani, A. A. (2018). Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP), 108–116. SciTePress. https://doi.org/10.5220/0006639801080116

[3] Ahmed, C. M., Palleti, V. R., & Mathur, A. P. (2017). WADI: A Water Distribution Testbed for Research in the Design of Secure Cyber Physical Systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks (CySWATER '17), 25–28. ACM. https://doi.org/10.1145/3055366.3055375

[4] Goh, J., Adepu, S., Junejo, K. N., & Mathur, A. (2017). A Dataset to Support Research in the Design of Secure Water Treatment Systems. In Critical Information Infrastructures Security (CRITIS 2016), LNCS 10242, 88–99. Springer. https://doi.org/10.1007/978-3-319-71368-7_8.