Back to Blog
General

Evaluating Conway

Oct 12, 2025 by Yogev Angelovici, Nahom Seyoum

Introduction


There are an unlimited number of ways to describe a dataset. To compare different discovery systems, evaluators must quantify utility of the reports, in addition to accuracy. Given these constraints, the following evaluation is based on datasets that quantify user preference. The three datasets selected are engineered to contain a set of distinct critical events that are thoroughly annotated by domain experts.

In this evaluation, Conway identified 20% of the critical events found in this data [1]. This success rate far exceeds the baseline performance of other tested models, more than doubling the next nearest score.

Our Datasets


Each dataset must contain realistic data (either a sample of real data collected or otherwise engineered to reflect realistically noisy data conditions. Each dataset must also be comprehensively-annotated by domain experts, with both information on the full set of distinct events and the way in which they are recorded in the data collected. Lastly, each dataset must contain a large quantity of collected datapoints.

Due to the effort required to generate datasets with these requirements, there are few available. We made a special request to two research institutions in order to collect the following datasets:

CICIDS 2017 (Canadian Institute for Cybersecurity Intrusion Detection System)


The CICIDS 2017 dataset [2] represents one of the most comprehensive and realistic network intrusion detection benchmarks available. Created by the Canadian Institute for Cybersecurity, it contains 2.4 million network flows captured over 5 days of controlled attack scenarios against a realistic corporate network infrastructure. The dataset includes 17 distinct attack types ranging from:
  • Brute force attempts (FTP, SSH, Web)
  • Sophisticated exploits (Heartbleed)
  • Denial of service variants (Slowloris, Hulk, GoldenEye)
  • Infiltration attempts (Metasploit, Cool Disk)
  • Botnet deployment (ARES)
  • Reconnaissance activities (port scanning)

  • WaDI (Water Distribution)


    The WaDI dataset [3], collected in October 2017, represents 16 days of operations from a scaled-down but fully functional water distribution testbed in Singapore. This dataset contains 123 sensors and actuators monitoring a complete water treatment and distribution system, including:
  • Raw water tanks
  • Chemical dosing systems
  • Elevated reservoirs
  • Consumer distribution networks


  • The dataset includes 15 distinct cyber-physical attacks executed against the operational technology infrastructure. We specifically examined the set of data generated on the 9th of October which contained all 15 of these attack types.

    SWaT (Secure Water Treatment)


    The SWaT dataset [4] comes from the same Singapore facility but focuses on the water treatment process itself rather than distribution. It contains 11 days of operations with 51 tags (sensors and actuators) across 6 stages of water treatment:
  • Raw water intake
  • Chemical dosing
  • Ultrafiltration
  • Dechlorination
  • Reverse osmosis
  • Clean water storage


  • The dataset includes 41 attack scenarios that target different stages of the treatment process. We specifically took a subset of this dataset collected on the 5th of July, which contained 6 unique attack patterns.

    Methodology

    Setup

  • We evaluated each system using a standardized prompt that provided basic domain information about the dataset and asked the system to surface anything of interest to the user. No additional guidance, labels, or hints about specific attacks were provided during the run.
  • For network datasets (CICIDS), we provided: dataset size (2.4M flows), feature descriptions (source/destination IPs, ports, protocols, flow statistics), and an indication that this was a set of network logs. No information about attack periods or types was disclosed.
  • For cyber-physical datasets (WaDI/SWaT), we provided: sensor/actuator descriptions and basic knowledge that this was logs of a physical system relating to water management. No information about attack periods or types was disclosed.
  • Foundational models (Claude 4.5 Sonnet, GPT-5, Gemini 2.5 Pro) required wrapper agents instead of being able to do direct reasoning since the datasets were too large to directly upload as artefacts. For these wrappers, we chose to turn to the code editing agents made available by their respective labs. This gave each foundational model the maximum freedom to make its own analysis scripts and tools to use on the datasets.
  • We used Claude Code for Sonnet, Codex for GPT-5, and Gemini CLI for Gemini
  • Each agent was instantiated with a clean environment containing only the datasource.
  • No system had access to attack labels, ground truth, or the evaluation rubric during analysis.

  • Evaluation

  • Once each system completed its analysis, we collected all generated artefacts each system outputted, including written reports that were generated and direct responses to the user.
  • We evaluated each report against our 3-point rubric per attack, checking for: (1) Statistical artifact detection, (2) Entity identification, and (3) Root cause analysis.
  • For statistical artifacts, we accepted any valid detection that identified a pattern signature within ±2 minutes of the actual timeline's window.
  • For entity identification, we required pointing out exact matching entities involved in an incident. For network attacks, this meant correctly tracing through NAT translations. For cyber-physical attacks, this meant identifying specific sensors/actuators.
  • For root cause analysis, we accepted semantically equivalent descriptions of the cause and mechanism behind the event. For example, "credential stuffing," "brute force," and "dictionary attack" were all acceptable for FTP-Patator.
  • We calculated scores as: (Points earned / Maximum possible points) × 100%, where maximum points = (Number of events) × 3.
  • See the specific rubrics and criteria for each dataset here: https://github.com/conway-engineering/eval-rubrics

Results


Total Points by System
Conway
25
21.9%
Cursor
11
9.6%
Devin
11
9.6%
Claude
10
8.8%
Codex
9
7.9%
Gemini
0
0%

Analysis


We found that most of the difference between Conway and competitors lay in discovering unexpected data patterns.

We noted that the performance of Conway and competitors in the CIC-IDS scenario was within a 7% margin (4 points of the 51 available). This seemed to be due to both Claude and ChatGPT having a good prior of the signs that would indicate cyberattacks. They excelled at catching attacks with widely known and obvious signatures like DDOS and port scanning.

However, in both the WaDI and SWaT scenarios, Conway had a lead in the double digits (11% and 15% better than the next best model for each dataset respectively). The other agents seemed to struggle to understand what kinds of patterns to look for and ended up performing poorly in both scenarios.

This supports the fact that Conway's disaggregation of pattern detection and pattern interpretation is the correct approach. Conway is able to surface patterns that other agents missed in the water treatment scenarios and score significantly higher.

Future Directions


There are several areas for improvement. The two most salient ones in these scenarios involved failures in pattern detection and reasoning over causal relationships. We trace both of these shortcomings to the need for more–and higher quality–context needed for the system to give definitive conclusions.

Most of the missed points were due to missed scenarios that had subtle signatures. Specifically, there was a case in which a class of patterns had typical statistical significance and high contextual significance. Pushing our detection mechanisms to integrate domain context and understand how to prioritize weak, contextually significant patterns is a key place to improve.

The other missed points, in scenarios that were correctly flagged, involved uncertainty on casual relationships within the data. Although Conway correctly flagged the statistical signature and entities involved, the system didn't directly have evidence or prior knowledge of the direction of causation, and failed to identify the third point in its entirety. This relationship requires more knowledge of process and dataset topology, which in the future can be provided by information beforehand or by more rigorous statistical investigations at test time.

Disclaimer


This benchmark is not a perfect measurement of our system's performance. Each of the test datasets had a small set of salient patterns that were each captured across multiple variables. There is often more nuance to pattern discovery and analysis in deployments. Factors like the quality and comprehensiveness of collected data have a large impact on system performance.

In order to better understand the performance of our system in these scenarios, we have an internal curated set of annotated datasets. We plan on sharing further results with the general public.

Citations

[1] Conway Research Team. (2024). Pattern Discovery Evaluation Results. Internal Research Report.

[2] Sharafaldin, I., Habibi Lashkari, A., & Ghorbani, A. A. (2018). Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP), 108–116. SciTePress. https://doi.org/10.5220/0006639801080116

[3] Ahmed, C. M., Palleti, V. R., & Mathur, A. P. (2017). WADI: A Water Distribution Testbed for Research in the Design of Secure Cyber Physical Systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks (CySWATER '17), 25–28. ACM. https://doi.org/10.1145/3055366.3055375

[4] Goh, J., Adepu, S., Junejo, K. N., & Mathur, A. (2017). A Dataset to Support Research in the Design of Secure Water Treatment Systems. In Critical Information Infrastructures Security (CRITIS 2016), LNCS 10242, 88–99. Springer. https://doi.org/10.1007/978-3-319-71368-7_8.