The Infra Change Lifecycle
- Overview: the full lifecycle
- 1. Understand the change
- 2. Find what to change
- 3. Branch and edit
- 4. Lint and validate locally
- 5. Dry-run against production inventory
- 6. Open the merge request
- 7. Pipeline runs automated checks
- 8. Peer review
- 9. Merge and deploy
- 10. Verify in production
- Rollback
- Shortcuts and when to use them
Overview: the full lifecycle
Ticket → Understand → Find file → Branch → Edit → Lint → Dry-run → MR → CI → Review → Merge → Deploy → Verify
Each step catches a different class of problem. Skipping steps is how production outages happen.
1. Understand the change
Before touching any code, make sure you understand:
- What is the desired end state? (not "what should I change" but "what should the system look like after")
- Which hosts are affected? (one? a group? all?)
- Is this change reversible? What does rollback look like?
- Are there dependencies? (does this require a firewall change? a cert? a FreeIPA rule?)
- Is there a maintenance window? Can this be deployed during business hours?
If the ticket is unclear, ask before starting. A 10-minute conversation is faster than a 2-hour incident.
2. Find what to change
# Search for the variable or config option
grep -r "setting_name" inventories/ roles/
# Find which template generates the config
grep -r "setting_name" roles/*/templates/
# Check the role defaults to understand what is configurable
cat roles/nginx/defaults/main.yml
Common outcomes:
- Change a value — edit the right group_vars or host_vars file
- Add a new variable — add it to group_vars, make sure the role's template uses it
- Change role behaviour — edit the task or template file in the role
- New service on a host — add the role to site.yml or a playbook, add host to the group
3. Branch and edit
git checkout main
git pull origin main
git checkout -b feature/INF-1234-description-of-change
# Make your changes
# Use $EDITOR or your preferred tool
# Stage only what you intend to change
git add inventories/production/group_vars/webservers.yml
# Review the diff
git diff --staged
# Commit with a meaningful message
git commit -m "Update nginx client_max_body_size for webservers
Ticket: INF-1234
Increasing from 1m to 64m to support large file uploads.
Applies to all hosts in the webservers group."
4. Lint and validate locally
# YAML syntax check
yamllint inventories/production/group_vars/webservers.yml
# Ansible lint — checks for best practice violations
ansible-lint
# Syntax check the playbook
ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini
Fix any errors before proceeding. Lint failures in CI will block your MR anyway — catch them now.
5. Dry-run against production inventory
# Full dry-run with diff — see exactly what would change
ansible-playbook site.yml \
--check --diff \
-i inventories/production/hosts.ini \
--limit webservers # only run against webservers group
# Narrow further to a single host if possible
ansible-playbook site.yml \
--check --diff \
-i inventories/production/hosts.ini \
--limit web01.example.com
Review the diff carefully:
- Do you see only the files you expected to change?
- Does the diff show the right before/after values?
- Are any unexpected files changing? (indicates a variable collision or wider impact)
- Are any hosts changing that should not be? (check your --limit)
Paste the --check --diff output into the MR description or attach it to the ticket.
6. Open the merge request
git push -u origin feature/INF-1234-description-of-change
Open the MR in GitLab. Use a description template like:
## What
[What is being changed]
## Why
Ticket: INF-1234 — [Brief description from ticket]
## Hosts affected
[list of hosts or groups]
## Dry-run output
[paste --check --diff output or attach file]
## Rollback
[How to revert: revert this MR commit, or manual steps]
7. Pipeline runs automated checks
When you push and open an MR, the CI pipeline runs automatically. Typical jobs:
- ansible-lint — must pass before anything else runs
- syntax-check — verifies all playbooks parse correctly
- dry-run (optional) — runs --check --diff against the production or staging inventory; output saved as an artifact
If the pipeline fails: click the failed job in GitLab, read the log, fix the issue, push again. The pipeline re-runs automatically.
8. Peer review
Assign a reviewer who knows the relevant system. What a good reviewer looks at:
- Is the change in the right file? (group_vars vs host_vars, right group name)
- Does the value make sense? Is it correct syntax/format for what the role expects?
- Are there unintended side effects? (variable used elsewhere, other roles referencing it)
- Is the dry-run output as expected?
- Is there a documented rollback plan for high-risk changes?
Respond to comments, push updates, and mark conversations resolved when addressed.
9. Merge and deploy
Once approved and pipeline is green: merge the MR.
Depending on your pipeline setup:
- Auto-deploy on merge — pipeline on main automatically runs the playbook. Watch the pipeline.
- Manual deploy job — click the play button on the deploy job in the pipeline. Review the job log as it runs.
- No CI deploy — clone main and run the playbook manually with the correct inventory and limit.
# Manual deploy (if no CI automation)
git checkout main
git pull origin main
ansible-playbook site.yml \
-i inventories/production/hosts.ini \
--limit webservers \
--tags nginx
# Run everything EXCEPT a specific tag (e.g. skip a long data migration)
ansible-playbook site.yml \
-i inventories/production/hosts.ini \
--skip-tags migrate-db
command and shell modules skip execution entirely in check mode, so subsequent tasks that depend on their output will also fail or report incorrectly. template and copy are reliable in check mode; command/shell are not. Always treat --check output as a guide, not a guarantee.
10. Verify in production
After deployment, confirm the change took effect and the service is healthy:
# Check the service is running
systemctl status nginx
# Check the config file has the expected content
# nginx.conf often includes conf.d/ — search there too
grep -r client_max_body_size /etc/nginx/
# Validate the live config
nginx -t
# Test the service responds
curl -v http://app.example.com/health
# Check logs for errors since the deployment
journalctl -u nginx --since "5 minutes ago"
Only close the ticket once you have confirmed the change works as expected.
Rollback
Something went wrong after merge. Act quickly:
Option 1: Revert the MR commit
# Find the merge commit
git log --oneline -5
# Revert it
git revert -m 1 COMMIT_HASH
# Push and open an MR for the revert
git push -u origin revert/fix-bad-change
Option 2: Quick emergency fix without waiting for review
git checkout main && git pull
git checkout -b hotfix/urgent-rollback
# Edit the file back to the previous value manually
git add . && git commit -m "hotfix: revert bad setting — causing nginx errors"
git push -u origin hotfix/urgent-rollback
# Open MR (mark as urgent), deploy, then close MR after the fact
Shortcuts and when to use them
In real-world situations, shortcuts exist. Use them consciously, not by default:
- Skip the MR for a hotfix — apply directly and open a post-hoc MR for the record. Only for genuine production emergencies where minutes matter.
- Skip the dry-run — acceptable for trivially safe changes (adding a new comment, changing a log level). Never skip for config changes that affect running services.
- Use -e for one-time override — fine for testing a value. Always follow up with a proper change in the repo.