beachy

SOUTH AFRICA Loving the beach, South Africa, each other, sea glass, driftwood, craft markets.

I Use AI Every Day. What I Found Will Shock You.

I use AI tools every single day. Not for fun. Not to impress anyone. Because I’m a 66-year-old self-taught developer who started coding from scratch in May 2025, and without them I’d still be staring at a blank screen wondering what a variable is.

 

 

I use them for coding, debugging, copywriting, image generation prompts, research — the works. Nine different models. Daily. Some of them I love. Some of them make me want to eat my keyboard.

So I built a little rating tool, scored them across 14 use cases on a 1 to 19 scale, and wrote it all down. No PR fluff. No affiliate links. No one paid me a cent. Just a grumpy old man with opinions and data to back them up.

Here’s what I found.

 

The Numbers First

Overall averages out of 19:

Model :  Average

Claude: 16.8
Gro:k 14.6
Perplexity 14.2
z.ai 14.0
Qwen: 12.9
DeepSeek: 12.6
ChatGPT / OpenA:I 10.6
Copilot: 9.1
Gemini: 9.9

Yes, Claude wins.

No, I’m not embarrassed to say it.

Data is data.

If my daily driver came in at 9.1 I’d say that too.

 

The Surprises

Debugging is where things arw ugly.

Coding is one thing — most of the capable models can generate working code. Debugging is where you find out who can actually think. My ratings: Claude and z.ai both sitting at 18–19. ChatGPT? . Gemini and Copilot? 6. That is not a gap. That is a chasm.

If you’re using ChatGPT to debug code and wondering why you’re losing your mind — now you know.

z.ai came out of nowhere.

Most people haven’t heard of it. I hadn’t been using it long when I rated it. But it scored 18/19 on debugging. Eighteen. It’s not a fluke — it’s genuinely good. Keep an eye on it.

DeepSeek is an underrated hidden gem — and wildly inconsistent.

Not Claude. Not ChatGPT. DeepSeek won image generation prompts at 18/19. It also scored 19 on research. But ask it to debug your code and you get a 7. Ask it to help with scheduling and you’re getting a 5. Some responses are genuinely useless. But when it locks in, it’s operating at a level that stops you mid-scroll. That inconsistency isn’t a dealbreaker — it’s just something you need to know going in. The good ones are gold.

Grok has a Twitter problem.

Grok is a solid tool — 14.6 average overall, wins data analysis outright at 19/19. But in research, it keeps referencing X (formerly Twitter, forever annoying) as if it’s a primary source. It’s not. It’s where people go to argue about sport. That bias drags the research score down considerably. Fix that and Grok would be genuinely frightening.

OpenAI and Gemini hallucinate. Casually.

Unless you specifically tell them to go and actually research something, they’ll make things up and serve them to you with complete confidence. They will research if instructed — but then they tend to pull from whatever sources they think you want to hear. Qwen does this too, just slightly less aggressively.

Copilot is quietly the worst — and I only keep it for comedy relief.

9.1 average. It’s Microsoft’s product, it lives inside Microsoft’s tools, and you’d think that integration alone would give it an edge. It doesn’t. Across coding, debugging, writing, data — it underperforms consistently. I haven’t seriously used it since The Practice went live over at Beachy Studio. Copilot  is still here. . Occasionally I fire it up for a laugh. That’s about all it’s good for right now.

Perplexity surprised me on the upside.

I hadn’t used it for coding due to free tier limitations, which penalised its code score. But in research, summarising documents, explaining concepts, and comparing options — it’s genuinely excellent. 17–18 across those categories. More people should be using it for those tasks.

“My voice” is a real problem with most models.

For blog writing, I gave Claude 17/19. But even Claude doesn’t fully nail my voice without serious prompting. Most models produce technically correct copy that sounds like it was written by a well-meaning robot. If your writing has character — and mine does — you’ll be rewriting. A lot. I do.

 

Who Wins What

Coding (writing): Claude & Grok — tied at 19
Coding (debugging): Claude — 19 (z.ai close behind at 18)
Copywriting: Claude — 17
Image generation prompts: DeepSeek — 18
Research: Claude & DeepSeek — tied at 19
Summarising documents: Claude & Perplexity — tied at 19
Brainstorming: Claude — 19 (Grok close at 18)
Data analysis: Grok — 19
Explaining concepts: Perplexity — 18
Comparing options: Qwen — 18

No single model dominates everything.

Which means if you’re using just one, you’re leaving performance on the table.

 

What I Actually Do

I run multiple models. I use Claude as my primary. I reach for DeepSeek when I’m writing image prompts. I use Perplexity when I need research that won’t gaslight me. I’ve started giving z.ai more attention.

This isn’t about loyalty. It’s about getting the job done properly.

The AI landscape moves fast. These ratings reflect my experience as of early 2026. Some of these models are updating constantly — today’s 4/19 could be next month’s 14/19. But right now, today, this is what I found.

Make of it what you will.

 

Honourable mention:

Mistral. I didn’t include it in the formal ratings because the free tier hammers you so hard you barely get to know the model. What I did see puts it in Perplexity territory — genuinely capable, worth watching. If they ever loosen the free tier reins, it deserves a proper score.

Beachy © 2025 Frontier Theme