The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents | Towards Data Science

Why read this article?

one about how to structure your prompts to enable your AI agent to perform magic. There are already a sea of articles that goes into detail about what structure to use and when so there’s no need for another.

Instead, this article is one out of a series of articles that are about how to keep yourself, the coder, relevant in the modern AI coding ecosystem.

It’s about learning the techniques that enable you to excel in utilising coding agents better than those who blindly hit tab or copy-paste.

We will go into the concepts from existing software engineering practices that you should be aware of, and go into why these concepts are relevant, particularly now.

By reading this series, you should have a good idea of what common pitfalls to look for in auto-generated code, and know how to guide a coding assistant to create production grade code that is maintainable and extensible.

This article is most relevant for budding programmers, graduates, and professionals from other technical industries that want to level up their coding expertise.

What we will cover not only makes you better at using coding assistants but also better coders in general.

The Core Concepts

The high level concepts we’ll cover are the following:

Code Smells
Abstraction
Design Patterns

In essence, there’s nothing new about them. To seasoned developers, they are second nature, drilled into their brains through years of PR reviews and debugging. You eventually reach a point where you instinctively react to code that “feels” like future pain.

And now, they are perhaps more relevant than ever since coding assistants have become an essential part of any developers’ experience, be it juniors to seniors.

Why?

Because the manual labor of writing code has been offloaded. The primary responsibility for any developer has now shifted from writing code to reviewing it. Everyone has effectively become a senior developer guiding a junior (the coding assistant).

So, it’s become essential for even junior software practitioners to be able to ‘review’ code. But the ones who will thrive in today’s industry are the ones with the foresight of a senior developer.

This is why we will be covering the above concepts so that in the very very least, you can tell your coding assistant to take them into account, even if you yourself don’t exactly know what you’re looking for.

So, introductions are now done. Let’s get straight into our first topic: Code smells.

Code Smells

What is a code smell?

I find it a very aptly named term – it’s the equivalent of sour smelling milk indicating to you that it’s a bad idea to drink it.

For decades, developers have learnt through trial and error what kind of code works long-term. “Smelly” code are brittle, prone to hidden bugs, and difficult for a human or AI agent to understand exactly what’s going on.

Thus it is generally very useful for developers to know about code smells and how to detect them.

Useful links for reading more about code smells:

https://luzkan.github.io/smells

https://refactoring.guru/refactoring/smells

Now, having used coding agents to build everything from professional ML pipelines for my 9-5 job to entire mobile apps in languages I’d never touched before for my side-projects, I’ve identified two typical “smells” that emerge when you become over-reliant on your coding assistant:

Divergent Change
Speculative Generality

Let’s go through what they are, the risks involved, and an example of how to fix it.

The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents | Towards Data Science — Photo by Greg Jewett on Unsplash

Divergent Change

Divergent change is when a single module or class is doing too many things at once. The purpose of the code has ‘diverged’ into many different directions and so rather than being focused on being good at one task (Single Responsibility Principle), it is trying to do everything.

This results in a painful situation where this code is always breaking and thus requires fixing for various independent reasons.

When does it happen with AI?

When the developer is not engaged with the codebase and blindly accepts the Agent output, you are doubly susceptible to this.

Yes, you may have done all the correct things and made a nicely structured prompt that adheres to the latest is in prompt engineering.

But in general, if you ask it to “add functionality to handle X,” the agent will usually do exactly as it is told and cram code into your existing class, especially when the existing codebase is already very complicated.

It is ultimately up to you to take into account the role, responsibility and intended usage of the code to come up with a holistic approach. Otherwise, you’re very likely to end up with smelly code.

Example — ML Engineering

Below, we have a ModelPipeline class from which you can get whiffs of future extensibility issues.


class ModelPipeline:
    def __init__(self, data_path):
        self.data_path = data_path

    def load_from_s3(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

    def clean_txn_data(self, data):
        print("Cleaning specific transaction JSON format")
        return "cleaned_data"

    def train_xgboost(self, data):
        print("Running XGBoost trainer")
        return "model"

A quick warning:

We can’t talk in absolutes and say this code is bad just for the sake of it.

It always depends on the wider context of how code is used. For a simple codebase that isn’t expected to grow in scope, the below is perfectly fine.

Also note:

It’s a contrived and simple example to illustrate the concept.
Don’t bother giving this to an agent to prove it can figure out this is smelly without being told so. The point is for you to recognise the smell before the agent makes it worse.

So, what are things that should be going through your head when you look at this code?

Data retrieval: What happens when we start having more than one data source, like Bigquery tables, local databases, or Azure blobs? How likely is this to happen?
Data Engineering: If the upstream data changes or downstream modelling changes, this will also need to change.
Modelling: If we use different models, LightGBM or some Neural Net, the upstream modelling needs to change.

You should notice that by coupling Platform, Data engineering, and ML engineering concerns into a single place, we’ve tripled the reason for this code to be modified – i.e. code that is beginning to smell like ‘divergent change‘.

Why is this a possible problem?

Operational risk: Every edit runs the risk of introducing a bug, be it human or AI. By having this class wear three different hats, you’ve tripled the risk of this breaking, since there’s three times as more reasons for this code to change.
AI Agent Context Pollution: The Agent sees the cleaning and training code as part of the same problem. For example, it is more likely to change the training and data loading logic to accommodate a change in the data engineering, even though it was unnecessary. Ultimately, this increases the ‘divergent change’ code smell.
Risk is magnified by AI: An agent can rewrite hundreds of lines of code in a second. If those lines represent three different disciplines, the agent has just tripled the chance of introducing a bug that your unit tests might not catch.

How to fix it?

The risks outlined above should give you some ideas about how to refactor this code.

One possible approach is as below:

class S3DataLoader:
    """Handles only Infrastructure concerns."""
    def __init__(self, data_path):
        self.data_path = data_path

    def load(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

class TransactionsCleaner:
    """Handles only Data Domain/Schema concerns."""
    def clean(self, data):
        print("Cleaning specific transaction JSON format")
        return "cleaned_data"

class XGBoostTrainer:
    """Handles only ML/Research concerns."""
    def train(self, data):
        print("Running XGBoost trainer")
        return "model"

class ModelPipeline:
    """The Orchestrator: It knows 'what' to do, but not 'how' to do it."""
    def __init__(self, loader, cleaner, trainer):
        self.loader = loader
        self.cleaner = cleaner
        self.trainer = trainer

    def run(self):
        data = self.loader.load()
        cleaned = self.cleaner.clean(data)
        return self.trainer.train(cleaned)

Formerly, the model pipeline’s responsibility was to handle the entire DS stack.

Now, its responsibility is to orchestrate the different modelling stages, whilst the complexities of each stage is cleanly separated into their own respective classes.

What does this achieve?

1. Minimised Operational Risk: Now, concerns are decoupled and responsibilities are stark clear. You can refactor your data loading logic with confidence that the ML training code remains untouched. As long as the inputs and outputs (the “contracts”) stay the same, the risk of impacting anything downstream is lowered.

2. Testable Code: It is significantly easier to write unit tests since the scope of testing is smaller and well defined.

3. Lego-brick Flexibility: The architecture is now open for extension. Need to migrate from S3 to Azure? Simply drop in an AzureBlobLoader. Want to experiment with LightGBM? Swap the trainer.

You ultimately end up with code that is more reliable, readable, and maintainable for both you and the AI agent. If you don’t intervene, it’s likely this class become bigger, broader, and flakier and end up being an operational nightmare.

Speculative Generality

Whilst ‘Divergent Change‘ occurs most often in an already large and complicated codebase, ‘Speculative Generality‘ seems to occur when you start out creating a new project.

This code smell is when the developer tries to future-proof a project by guessing how things will pan out, resulting in unnecessary functionality that only increases complexity.

We’ve all been there:

“I’ll make this model training pipeline support all kinds of models, cross validation and hyperparameter tuning methods, and make sure there’s human-in-the-loop feedback for model selection so that we can use this for all of our training in the future!”

only to find that…

It’s a monster of a job,
code turns out flaky,
you spend too much time on it
whilst you’ve not been able to build out the simple LightGBM classification model that you needed in the first place.

When AI Agents are susceptible to this smell

I’ve found that the latest, high performing coding agents are most susceptible to this smell. Couple a powerful agent with a vague prompt, and you quickly end up with too many modules and hundreds of lines of new code.

Perhaps every line is pure gold and it’s exactly what you need. When I experienced something like this recently, the code certainly seemed to make sense to me at first.

But I ended up rejecting all of it. Why?

Because the agent was making design choices for a future I hadn’t even mapped out yet. It felt like I was losing control of my own codebase, and that it would become a real pain to undo in the future if the need arises.

The Key Principle: Grow your codebase organically

The mantra to remember when reviewing AI output is “YAGNI” (You ain’t gonna need it). It’s a principle in software development that suggests you should only implement the code you need, not the code you foresee.

Start with the simplest thing that works. Then, iterate on it.

This is a more natural, organic way of growing your codebase that gets things done, whilst also being lean, simple, and less susceptible to bugs.

Revisiting our examples

We previously looked at refactoring Example 1 (The “Do-It-All” class) into Example 2 (The Orchestrator) to demonstrate how the original ModelPipeline code was smelly.

It needed to be refactored because it was subject to too many changes for too many independent reasons, and in its current state the code was too brittle to maintain effectively.

Example 1

class ModelPipeline:
    def __init__(self, data_path):
        self.data_path = data_path

    def load_from_s3(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

    def clean_txn_data(self, data):
        print("Cleaning specific transaction JSON format")
        return "cleaned_data"

    def train_xgboost(self, data):
        print("Running XGBoost trainer")
        return "model"

Example 2

class S3DataLoader:
    """Handles only Infrastructure concerns."""
    def __init__(self, data_path):
        self.data_path = data_path

    def load(self):
        print(f"Connecting to S3 to get {self.data_path}")
        return "raw_data"

class TransactionsCleaner:
    """Handles only Data Domain/Schema concerns."""
    def clean(self, data):
        print("Cleaning specific transaction JSON format")
        return "cleaned_data"

class XGBoostTrainer:
    """Handles only ML/Research concerns."""
    def train(self, data):
        print("Running XGBoost trainer")
        return "model"

class ModelPipeline:
    """The Orchestrator: It knows 'what' to do, but not 'how' to do it."""
    def __init__(self, loader, cleaner, trainer):
        self.loader = loader
        self.cleaner = cleaner
        self.trainer = trainer

    def run(self):
        data = self.loader.load()
        cleaned = self.cleaner.clean(data)
        return self.trainer.train(cleaned)

Previously, we implicitly assumed that this was production grade code that was subject to the various maintenance changes/feature additions that are frequently made for such code. In such context, the ‘Divergent Change’ code smell was relevant.

But what if this was code for a new product MVP or R&D? Would the same ‘Divergent Change’ code-smell apply in this context?

In such a scenario, opting for example 2 may actually be the smellier choice.

If the scope of the project is to consider one data source, or one model, building three separate classes and an orchestrator may count as ‘pre-solving’ problems you don’t yet have.

Thus, in MVP/R&D situations where detailed deployment considerations are unknown and there are specific input data/output model requirements, example 1 could be more appropriate.

The Overarching Lesson

What these two code smells reveal is that software engineering is rarely about “correct” code. It is about context.

A coding agent can write perfect Python in both function and syntax, but it doesn’t know your entire business context. It doesn’t know if the script it’s writing is a throwaway experiment or the backbone of a multi-million dollar production pipeline revamp.

Efficiency tradeoffs

You could argue that we can simply feed the AI every little detail of business context, from the meetings you’ve had to the tea-break chats you had with a fellow colleague. But in practice, that isn’t scalable.

If you have to spend half and hour writing a “context memo” just to get a clean 50-line function, have you really gained efficiency? Or have you just transformed the manual labor of writing code into that of writing prompts?

What makes you stand out from the rest

In the age of AI, your value as a data scientist has fundamentally changed. The manual labour of writing code has now been removed. Agents will handle the boilerplating, the formatting, and unit testing.

So, to make yourself stand out from the other data scientists who are blindly copy pasting code, you need to have the structural intuition to guide a coding agent in a direction that is relevant for your unique situation. This results in better reliability, performance, and outcomes that are reflected on you, making you stand out.

But to achieve this, you need to build this intuition that comes years of experience by knowing the code smells we’ve discussed, and the other two concepts (design patterns, abstraction) that we will delve into in subsequent articles.

And ultimately, being able to do this effectively gives you more headspace to focus on the problem solving and architecting a solution a problem – i.e. the real ‘fun’ of data science.

If you liked this article, see my Software Engineering Concepts for Data Scientists series, where we expand on the concepts most relevant for Data Scientists

The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents | Towards Data Science

Why read this article?

The Core Concepts

Code Smells

Divergent Change

When does it happen with AI?

Example — ML Engineering

Why is this a possible problem?

How to fix it?

Speculative Generality

When AI Agents are susceptible to this smell

The Key Principle: Grow your codebase organically

Revisiting our examples

Example 1

Example 2

The Overarching Lesson

Efficiency tradeoffs

What makes you stand out from the rest

Related Articles

Leave a Reply Cancel reply

Why read this article?

The Core Concepts

Code Smells

Divergent Change

When does it happen with AI?

Example — ML Engineering

Why is this a possible problem?

How to fix it?

Speculative Generality

When AI Agents are susceptible to this smell

The Key Principle: Grow your codebase organically

Revisiting our examples

Example 1

Example 2

The Overarching Lesson

Efficiency tradeoffs

What makes you stand out from the rest

Related Articles

Related Posts

Leave a Reply Cancel reply