Jack Hales AI Agents Research Journal - 2024

Jack Hales

Dataology

me@jackhales.com

Articles Background & Experience

← Back

Jack Hales AI Agents Research Journal - 2024

AI Agent thoughts and research from Jack Hales within 2024.

By Jack Hales | 17 November 2024

Preface: I want to define the minimum high-impact activities an agent can do within my workflow. Then, I want to evaluate the potential benefits of creating a permanent runtime, rather than a temporary "invoked" runtime.

What is an AI Agent?

My conception of an AI Agent is something that runs in a temporary "invoked" runtime, or a permanent runtime, within a specific structure of a LLM-model. It involves inputs, as well as an attempt to coerce to a model-decided output, and an output parser to run certain actions based on the models output.

The inputs conventionally involve a mix of a chaotic user-request, as well as common instructions for how to respond. Something like:

"You will respond to user input. You will return output in JSON. Your options are ---. The users input is: ---."

With a bit more rigidity - but the basic idea is encapsulated above.

Is an AI Agent even necessary?

This is an important aspect to review, before jumping into AI Agent world. To help, I will start with what I commonly do at work, as well as some unusual exceptions.

Programming: I am often taking feature ideas or bug fixes, and converting them into the code which is required to resolve the given task.
Emails: I read and respond to a plethora of emails - some business-important, some functional, some documentative.
Testing: I am often trying to solve data pipelines by doing testing on components.
Calling: I am often calling individuals, to help install software, or diagnose problems in systems. Not AI-Agent capable - within this current scope of initial review.
Meeting: I meet with other individuals. The only axis of adjusting this, is to change requirements, and redirect meetings to emails, calls, or texts - as Naval Ravikant refers to. Not AI-Agent capable.

The above does not define with enough definition, what a given task involves. This has often been my problem with using AI - it's a series of semi-random occurrences.

I also don't think I will be able to "plan" the perfect agent. It'll have to evolve - naturally - by using it in the field. I will make a conscious effort to test the ideas, rather than just discuss them.

What LLMs might be natively great for

LLMs are text parsers, and text-continuers. OpenAI has done a good job of separating the model, as well as the asker, so the models job is to step forward one more model response, then wait for user feedback.

Due to relying on training data, models think inside their training data - they think "in their box".

Ask an LLM to:

Tell you what you can make with existing ingredients: It will do well and provide you existing recipies. It will not risk creating a new recipie.
Ask an LLM to refactor or assess some code: It will do it within existing principles, unless directed to do something else it can understand.
Ask an LLM to make a decision: It will consider in-box options.
Ask an LLM to classify some input based on a series of categories: It will do it as it believes.
Ask an LLM to simulate being something: It will do as it believes, mimicing what it believes that thing is.
Ask an LLM to adjust prose, mimic a style: It will do as it believes.

These are high-level claims, and I hope you understand where I am coming from: Models think inside their box. This is OKAY! That is what they are made to do!

(PLEASE NOTE: I am writing a lot of fluff here. The above is to be revised after I do some more work.)

23 November 2024

I've travelled north from my place to visit family, on the way enjoying the recent 5-hour Lex Fridman interview with Dario Amodei, Christopher Olah and Amanda Askell - three deeply interesting guests. Dario is CEO, Christopher is working on understanding the internal circuits of model brains, and Amanda is a Philosophy-Engineer focussing on alignment, system prompts, and chatting with Anthropic's Claude.

One focus of Anthropic is safety. Dario believes in a "race to the top" approach, positive-sum.
Move fast, with safety backed by testing frameworks. Using Evals, as well as their RSP/ASL levels, they can use progressive testing.
Philosophy, in Amanda's work pertaining to prompting and neutralising edges of Claude, has served strongly.
Chris' work on Mechanistic Interpretation is very fascinating, as a relatively small emerging field.
Anthropic's Constitutional AI, and Synthetic data generation, are poised as potential ways to increase training data.

Their work is excellent, and the conversation was very enlightening as to how these LLM/AI companies operate and think.

ASL

ASL (AI Safety Levels), is a framework to quantify model broad-scale risk. It aims to distinguish risk levels, to justify more resource investment into ensuring safety is met. Since intelligence seems to increase when models run longer, work in workflows, or agents, I'd hope that companies are testing these as well as one-shot responses - which I'm sure they are!

ASL1 = relatively narrow-scoped AI, like DeepMind.
ASL2 = wider-scoped more general AI, capability of providing malicious instructions, misleading, limited autonomy.
ASL3 = ability to act, reflection, self-optimise (self-train).
ASL4 = self-adapt, complex workflows, potential emergent behaviours.
ASL5 = ...

(Simplified).

AI Application

My aim is to increase leverage by increasing code output, preserving code quality, and decreasing mental effort. I personally have to think a lot when working with code, as I expend a lot of energy just considering formatting, patterns, structures, and so-forth. If I can guide models to do most of the 0-to-1 work, guided by my input, and then prompt to fix, and manually do fixes - the hypothesis is that this will increase leverage.

I've done this using 4o + Canvas to help design a custom alert HTML email for Pharma Portal, and just today experimented with a Custom GPT to help building a new Pharma Portal onboarding feature.

Amanda's Prompt Engineering

Something Amanda Askell (Askell) said during the interview just hit me in a different spin. She refers to spending a high amount of time engineering prompts, if the company aims to spend and throughput a lot of usage to a specific model. For my use-case, I keep wanting to reach for the fun "framework building" approach, which is a heavy resource investment, and I keep refraining since it seems like eagar over-optimisation on something that might not be the right approach.

To extend this thinking - if working more generally, and not building a single throughput system - one might reason that it would be better to spend a large amount of time on smaller components, that can be mix-and-matched together. This makes sense with other similarities in software that we use, where we use components, libraries, and frameworks. This is compared to building rigid infrastructure that is coupled to other rigid infrastructure.

So - I think that is where I will spend a bit of time - working on a microframework for composing chains, which uses functions over very verbose classes.

24th of November, 2024

I've started working on the microai repo to spend some more time feeling out the APIs by writing a simple wrapper for models.

Spending a little too much time on representing message types, I want to get the benefit of flexible input types, but am also feeling the complexity increase. For example: I'd like to return the raw output of a OpenAI Message object from their openai module, but the downside is it means supporting more input types for usage of the model.

I'm considering writing a simple abstraction for nodes, and I will also look at how it feels simply using a single standard for messages. I'll report back once I've made some more progress!

8th of December, 2024

I've experimented with the incredible Cursor editor, and I've also been taking a deeper look into the existing tooling and experimentation with AI so I can help make sense of the direction.

Cursor has already supercharged my development, and I'm now also looking at understanding Sentiment Layer's, Knowledge Graphs, and reading articles and watching lectures on the latest architectures for building these systems. My gut tells me they are powerful, but very harmful for organisations without an understanding of symmetry and risk from how these systems work under-the-hood.

Enterprise Knowledge article on sementic layers explore cases of "semantic layers", which are in the area of knowledge graphs, unstructured-structured graphs, and how entity resolution (ER) fits into this. It defines the use-case in some real-world examples fitting:

Improving data quality and inconsistency
Duplicate and redundant records
Interoperability from a 360-view

The article argues that "Entity Resolution" is a core concept which helps solve a lot of problems within the context of a semantic layer - the ability to relate similar objects, as well as identify the same entity within different datasets.

It goes on to illustrate that the goal is to reach a "Entity Resolved Knowledge Graphs" (ERKG), which is a data graph which has been collapsed into a knowledge graph where duplicate entities are removed.

To bring this back to simpler language, knowledge graphs are smarter data graphs, where entity resolution is used to attribute graph nodes, and collapse and split similar or different data points to find new relationships.

16th of December, 2024

In the last 8 days since I mentioned my usage of Cursor, my diving into A.I. has made more progress. I've been a A.I.-doomer in the past, though I now feel quite differently about it. Having more experience with the latest tooling and latest approaches shows me the weaknesses, though I now see them as things that will be improved over time.

"The Next Big Model"

One example of how my thinking has changed, is that many people are focussed on the next big model to be created from OpenAI, Anthropic, so-forth. Although this excites me, I am also excited for the constant "well-rounding" and enhancements to logical thinking that can be made now, and in the near-future, to remove the rough edges for practitioners.

Put it this way: if there is a 3-5% "rough edge" in today's models (4o), increasing the size of the model will only increase the surface area of these rough edges. As we don't know how these models work internally, this can create a cascade of logical or technical errors in the future. With this in mind, I see a great path forward for refinement and distillation, then leading into a natural scaling.

Classifying

I've been using a lot of the latest A.I. tooling for classifying with our company's work recently. I've practices with countless prompting methods, with frustrating and exciting results. Both the determinism and non-determinism is fascinating to me in how we're going to wrangle these into the future.

I'm playing around with vectorising at the moment as another angle for classifying. This is going well - using LLM's to expand the content of the vector item, and then doing lookups on expanded item descriptions to try to marry them up.

Functional or OOP for LLM tool design?

I've also made some updates to microai in terms of creating a really basic representation of a Chain, using OOP and builder patterns. I then replaced some old chain code, and noticed it was remarkably metter. I'd been on a more pro-function bend recently with Python, though forcing myself to try LLM interaction using functions, then moving to OOP+Builder pattern, has shown me that it is absolutely the better pattern/paradigm to use when building tools that interact with LLMs.

I will add that I really like the dynamic representation of a chain within a builder pattern. I will experiment with other functions, but this is working well for now while I experiment organically.

I've expanded some of my thinking on this topic in Large Language Model Tool Design: Functional or OOP? on this website.