From Data to Deployment:

Digitizing Contracts with Vector Institute

Sharing our results from Vector's DarMoD program.

Earlier this year, we joined the Vector Institute's Data Readiness & Model Deployment program with a clear goal in mind: explore how AI could eliminate one of the most tedious parts of a real estate transaction — manual data entry.

For brokerages, this isn’t just a minor inconvenience. Every contract that flows through a real estate office needs to be reviewed, parsed, and keyed into internal systems. It’s repetitive, error-prone, and eats up valuable time that staff and agents could be spending elsewhere.

So we asked: what if this could be fully automated?

Tackling the Problem

Our project started with one of the most common and data-rich documents in the real estate world: the purchase and sale contract, sometimes known as the APS or CPS. These contracts are standardized to a degree, but still highly variable in practice — with handwritten edits, strikethroughs, multiple addenda, and non-standard formatting.

At first, we experimented with out-of-the-box tools, but found them too brittle for the level of nuance real-world contracts require. So we pivoted toward a more customized approach.

Building the Solution

Working with mentors from Vector and drawing on hundreds of anonymized, annotated contracts, we developed a multi-stage pipeline:

  1. Preprocessing: Removing unnecessary pages to improve efficiency and reduce noise.
  2. Strikethrough Detection and Masking: A custom machine learning model that detects struck-out values and covers them up to simplify reading the correct data.
    contract with strikethrough text
    Original contract with multiple strikethroughs
    contract with masked areas
    Masked output with strikethroughs removed
  3. Field Detection and Extraction: A second custom machine learning model that extracts key-value pairs based on a unique data schema for each province.
  4. Normalization: Removes junk and noise from extracted text, ensures standard formatting, and handles multi-value fields and relationships.
  5. Document Assembly: Returns a standardized output in an ordered format.
    data schema
    Data field list for a typical BC contract

This wasn’t a one-and-done process. We iterated through several versions, continuously refining accuracy and performance based on field-level metrics.

What We Achieved

By the end of the program, we had a fully functional data extraction system capable of pulling clean, structured information from contracts in seconds.

Here's what that will mean in practice:

  • 10-15 minutes of admin time saved per transaction
  • Reduction in keying errors and follow-up corrections
  • Structured data ready for payout, compliance, and reporting workflows

More importantly, this pipeline is modular — it can be extended to other forms like listings, representation agreements, and more.

What's Next

This project lays the groundwork for something much larger: true back-office automation. With reliable document understanding, we can start building AI-powered workflows that don't just extract data, but can also validate, flag issues, and move deals forward without human intervention.

In the months ahead, we’ll be integrating this pipeline directly into RealDesk and continuing to train it on more documents and edge cases. Our goal is simple: take the friction out of every back-office process, starting with the ones that waste the most time today.

Thanks

Huge thanks to the Vector Institute team for their guidance, support, and technical feedback throughout the program — and to our own team for the persistence and creativity that made this possible.

The real estate industry has waited long enough for meaningful innovation behind the scenes. We're excited to keep building it.

Ready to discover the future of real estate?

Let's chat.

Schedule a Call

Meet With Our Team

Events Calendar

Attend an Info Session