“How’s the Copilot rollout going?” your CTO asks during your quarterly review.
“Usage is steady,” you reply. “We’re hovering around 80% of developers using it weekly.”
“That’s good, right?” she asks.
You pause. The number looks fine. But you know things aren’t going well: the quality of AI-assisted PRs has been declining. More AI slop making it through review. Developers accepting first outputs without iteration. The same prompts being reused without understanding.
“The usage number is fine,” you say carefully. “But I’m more concerned about how they’re using it.”
If you’re leading AI tool adoption for developers (whether it’s GitHub Copilot, Cursor, or any other LLM-powered coding assistant), you’ve probably discovered this: Getting developers to try the tool is the easy part. Getting them to use it well, consistently, over time? That’s the real challenge.
After leading GitHub Copilot adoption for hundreds of engineers, here’s what I learned: the work doesn’t end at rollout. It begins there.
Scope note: This post is about AI coding assistants for developers: GitHub Copilot, Cursor, Windsurf, and similar tools. The focus is code generation, code review, and developer workflows. Some principles apply more broadly, but the challenges here are specific to engineering teams.
The Pattern You’ve Probably Seen: Progressive Decline
Let me describe what typically happens with AI coding tool adoption:
Week 1-4: The Exploration Phase
Developers try GitHub Copilot (or Cursor, or whatever tool you’ve chosen). Some early wins: autocomplete and agents saves time, boilerplate generation works well. Excitement in Slack. Demos in team meetings.
Week 5-12:
Usage patterns emerge. Some developers integrate it deeply into their workflow. Others use it occasionally. Some abandon it. This is normal. Tools fit different workflows differently.
But here’s where things start to drift:
Month 3-6:
Developers start accepting AI’s first output without iteration. Large AI-generated PRs appear with minimal context. Prompts get lazy (“make this better,” “fix this”). AI slop creeps into the codebase. Code review quality declines because reviewers assume AI code is good.
Month 6+:
Usage numbers might look stable. But the quality has degraded, MTTR has increased, frequency of errors is way higher. The tool has become a crutch rather than an amplifier.

Where Most Adoption Strategies Go Wrong
Most organizations approach this like a product launch:
- Choose a tool
- Get licenses
- Roll out access
- Run training
- Track usage
- Declare success when numbers look good
The problem: this treats adoption as an event, not a capability.
Event thinking says: “80% of developers used Copilot this month!” Capability thinking asks: “Are those developers generating code they understand? Are they iterating on prompts or accepting first outputs? Is quality holding?”
Event thinking optimizes for launch metrics. Capability thinking builds infrastructure for sustained quality.
We learned this the hard way. Our launch went fine. The real work was everything that came after.
What We Did:
We chose GitHub Copilot as our primary tool. (Your organization might choose something else. The principles still apply.)
Tool infrastructure:
- Copilot licenses for all developers
- Access to Claude and GPT-4 for architecture discussions and complex problem-solving
- MCP servers connecting AI to our internal context: JIRA for tickets, Confluence for architecture decisions, internal API docs, coding standards
Quality infrastructure:
- CI/CD gates to flag AI-generated code patterns
- Code review guidelines for AI-assisted PRs
- Automated checks for common AI slop patterns
Ongoing support:
- Weekly prompting workshops (not tool training; prompting is the skill)
- Office hours for questions
- Prompt library with real examples from our codebase
- Bi-weekly showcases: “Here’s how I used AI this sprint”
- Team champions with dedicated time allocation
Measurement: We made a pragmatic choice: measure adoption and code quality, not productivity.
Why? Building a proper DORA metrics dashboard with AI attribution would’ve taken 6 months. We didn’t have that infrastructure. We didn’t have the team to maintain it.
So we tracked: weekly active users by tool, adoption by team, PR lead time as a rough proxy, code review feedback patterns, and self-reported sentiment. Not perfect. Better than waiting 6 months for perfect dashboards while practices eroded.
The Problems We Faced
Quality Eroded Around Month 3
Early on, metrics looked good. 70-80% usage. But around month 3, we noticed a pattern in code reviews.
Developers were accepting AI’s first output without iteration.
They’d describe a problem to Copilot, get code back, glance at it, submit a PR. No refinement. No questioning. No iteration.
The result: code that technically worked but was overly complex, used patterns inconsistent with our codebase, lacked proper error handling, and had subtle bugs that surfaced days later.
Initial adopters were thoughtful. They iterated on prompts. They reviewed AI output critically. They tested thoroughly. But as adoption spread, those practices didn’t transfer. New users saw the tool as magic. “AI wrote it, so it must be good.”
The problem wasn’t usage declining. Quality declined while usage stayed steady.
Code Review Broke in Ways We Didn’t Expect
AI-generated code breaks traditional code review.
We started seeing PRs like this: 400+ lines of AI-generated code. Minimal context: “Used Copilot to implement feature X.” No explanation of what was generated versus what was modified. Submitter had watched AI write it, tweaked a few lines, clicked “Create PR.”
Reviewers were stuck. How do you review code you didn’t see being written? How do you know if the submitter even understands it?
Some reviewers spent 30+ minutes trying to parse what AI had generated. Others focused only on obvious bugs, missing design issues. Quality became inconsistent across teams.
We needed new code review practices. Training reviewers took time. Enforcement was inconsistent.

Prompting Is a Skill (Most Engineers Are Bad At It)
We assumed prompting was intuitive. “Just tell the AI what you want.”
Wrong. Effective prompting is a skill. We had excellent engineers who were terrible prompters.
Bad prompts we saw constantly:
- “Make this code better” (no criteria)
- “Fix this bug” (no context)
- 500 lines dumped with “optimize this” (context overload)
- Accepting first output without iterating (one-shot mentality)
The result: frustration. “Copilot doesn’t work for me” usually meant “I don’t know how to prompt effectively.”
Tool training isn’t enough. You need continuous prompting skill development. We underestimated this badly.
The Measurement Problem
Every exec wanted to know: “Are we more productive?”
I didn’t have a good answer.
Building comprehensive productivity measurement would have required 6+ months of data engineering, dashboard infrastructure we didn’t have, instrumentation across multiple tools, and a team to maintain it.
We didn’t have those resources. So we focused on adoption and quality signals instead of productivity metrics.
Our reasoning: if developers aren’t using the tool, productivity doesn’t matter. If they’re using it badly, productivity gains are illusory. If they’re using it well, improvements will show over time. PR lead time gives us a rough proxy without perfect measurement.
Not ideal. It worked.
The Sustainability Problem
This was the most subtle issue.
Maintaining good practices over time is hard when there’s no forcing function.
If you can submit AI-generated code without explaining it, some developers will. If you can skip testing AI code and it passes review, some developers will. If you can use lazy prompts and get acceptable output, some developers will.
The path of least resistance was: use AI carelessly.
We needed to make the path of least resistance: use AI well.
What Actually Worked
After seeing quality erode despite stable usage, we shifted our approach. The goal wasn’t driving usage up. It was building sustainable, high-quality practices.
CI/CD Gates
We added automated gates to our pipeline:
- Flag PRs with >30% AI-generated code for extra review
- Reject overly complex AI-generated functions
- Require tests for AI code (not just “it compiled”)
- Require documentation: if AI wrote it, you explain it
This wasn’t popular initially. One team lead called it “bureaucratic nonsense” in Slack. He came around eventually.
But it did two things: forced a quality bar (can’t merge sloppy AI code) and made AI usage visible (reviewers knew what to look for).
Within a month, PR quality improved. Developers became more intentional about what they asked AI to generate.

Context Integration (MCP Servers)
One reason AI usage dropped: generic AI doesn’t know your codebase.
We built MCP servers that gave AI access to our JIRA tickets, Confluence docs, internal API documentation, and code style guides.
Suddenly, AI could give context-aware answers: “Looking at JIRA-1234, here’s how to implement this matching our patterns…” or “Based on our error handling doc, you should…”
This made AI more useful. Useful tools get used.
The catch: building MCP servers takes engineering time. We had to prioritize it. If I did this again, I’d build context integration before launch, not after.
Champion Support That Actually Works
Our initial champion program: give people a badge, ask them to evangelize, hope they stay motivated.
That didn’t work. Champions burned out. I should have seen it coming.
What worked:
- Dedicated time for champions: 20% allocation for the program
- Monthly sync: champions met, shared problems, supported each other
- Budget for experiments: try new tools without approval cycles
- Career incentive: AI expertise counted toward promotion
- Rotation: new champions every 3 months to prevent burnout
Lightweight Metrics
We gave up on perfect productivity measurement.
Simple dashboard showing: weekly active users by tool, adoption by team (color-coded), PR lead time trend (6-week rolling average), and self-reported “AI made me faster this week” sentiment.
Updated every Monday. Reviewed in leadership meeting. No individual tracking.
Sustained Learning
Initial training: 2-hour session on “How AI Tools Work.”
Ongoing learning:
- Weekly “AI Office Hours”: bring your actual problems, get help
- Bi-weekly “Show & Tell”: someone demos a real use case, 15 minutes
- Prompt Library: curated, maintained, actually useful
- Monthly “AI Lunch”: casual discussion, no slides
Attendance dropped from 60 (week 1) to 15-20 (steady state). But those 15-20 kept coming. They became advocates.
Addressing Lazy Usage Directly
Let’s be honest: some developers were using AI to avoid thinking. Copy AI output. Submit PR. Hope nobody notices.
We addressed this culturally, not just with tools.
Made it explicit: “AI is a tool, not autopilot.” Added “Understanding Check” to code review: “Explain this code to me.” Celebrated iteration stories: “I used AI for 5 drafts before getting it right.” Made laziness visible: PRs without test coverage got rejected.
Breaking Down Work at Multiple Levels
One of the biggest problems: large AI-generated PRs. Developers would ask AI to implement entire features. 400+ line PRs. Impossible to review properly.
The solution was breaking down work at two levels.
At the PM level: write user stories that are small and focused. Each story should be implementable in 200 lines or less. Acceptance criteria specific enough that AI doesn’t over-engineer.
At the developer level: break implementation into smaller chunks. Use AI for one piece at a time, not the entire feature. Each PR should have a single, clear purpose.
When PMs write tight stories, developers naturally create smaller PRs. This required changing our planning process, not just our coding practices.
What I’d Do Differently
Start with constraints, not capabilities. We started by showing what AI could do. Better: start with the guardrails. “Here’s how to use AI safely. Here’s what good AI-assisted review looks like. Here’s the quality bar.” Constraints first, creativity second.
Celebrate month 6, not week 1. Early metrics are meaningless. Month 6 tells you if it stuck.
Rotate champions. Don’t let your best advocates burn out.
Build context integration early. Generic AI is cool for demos. Context-aware AI is useful for work. Invest in MCP servers before launch.
Address lazy usage as a culture problem. Tools won’t fix lazy developers. Culture will. Understanding your code is non-negotiable, whether you wrote it or AI did.
What Success Actually Looks Like
Good signs:
- Developers iterate on prompts (not accepting first output)
- AI-assisted PRs include context (“Here’s what AI generated, here’s what I modified”)
- Code review discussions focus on design, not just syntax
- Quality metrics stay stable or improve
Warning signs:
- PR size increasing without explanation
- Code review becoming rubber-stamp
- Developers can’t explain code they submitted
- Bug rates increasing in AI-assisted code
- Prompts getting lazier
The metric that matters most isn’t ”% of developers using AI.” It’s quality of AI-generated code over time.
If usage is high but quality is declining, that’s not success. That’s a problem.
Final Thoughts
If you’re rolling out AI coding tools in the coming months, remember this: high usage numbers feel like success. They might not be.
What matters isn’t that developers are using AI. It’s that they’re using it to generate code they understand. Not slop.
That requires continuous investment in prompting skills, guardrails, culture, and feedback loops.
This isn’t a launch. It’s a practice.
Monitor for quality erosion, not just usage decline. Your dashboard might show green while your codebase fills with AI slop. By the time you notice, you’ll have months of technical debt to clean up.