Experienced and data-driven Site Reliability Engineering leader with a track record of building cross-functional, geodistributed DevOps teams. Skilled at defining and implementing SRE best practices, driving innovation, with a strong focus on uptime and customer experience.
My strong mix of development and operations skills, deep experience with AWS, and understanding of the interplay between software development and operations have enabled me to build and lead high-performing teams that drive business results.
Helped lead the SRE team through the path to IPO and oversaw the completion of governance processes relevant to my teams
Interviewed M&A targets and helped develop the due dilligence processes for potential aquisitions with a focus on system architecture, system availability, technology stacks / depreciation, and operational readiness
Helped define divisional OKRs, and assisted in measuring the completion of R&D division goals on a quarterly basis
Developed cross functional SRE mentoring programs, to improve internal candidate pipelines, employee engagement + morale, and reduce attrition
Managed M&A integration from a deployment, system development, security and vulnerability management program standpoint
Presented to prospects and contributed to RFPs on the subjects of system availability, security defense in depth, vulnerability management, and cloud native technology adoption
Oversaw the execution and deliverables from network, system, release, network, and container platform engineering teams
Mentored and grew managers, along with individual contributors from a large cross-section of the business
Democratized operations and change management. Moved from constant firefighting to standard, repeatable, prompt process, without downtime for all patching, release, and administrative tasks. Developed and published OLAs to reduce friction and measure the success of process-driven Ops
Built out a Kubernetes focused team and partnered with development to launch our first microservices in EKS. Worked with leaders in Product and Development to build a comprehensive roadmap to Kubernetes in order to minimize infrastructure spend and release complexity, while maximizing uptime
Participated in the company's security and compliance steering committee, focused on providing world class protection for our systems and customer data
Developed the Release and System Engineering teams from the ground up and matured Operations and Network Engineering. Built teams responsible for core system operations, automated infrastructure provisioning, application deployment, and incident response
Successfully led the migration of Alkami's privately hosted customer and corporate environments to AWS
Oversaw the maintenance of remaining corporate hardware, implementation of enterprise vulnerability management programs, and acted as the product owner for my teams
Ran the production certification for PCI/SOC 2 Type 2/SOX assessments, and led the technical response for gap item resolution
Created repeatable incident response and retrospective processes and automation used for all severity 1 and 2 incidents. Reduced MTTR, customer satisfaction, and data quality through the development of a custom Slack chatbot focused on incident response and client communications.
Implemented a robust monitoring program using NewRelic and ElasticSearch. Championed the use of Infrastructure as Code. Transformed the team from point and click system builders to developers who produced highly automated, repeatable, and thoroughly tested infrastructure
Modernized the release and deploy process by implementing identical infrastructure and deployment automation in all environments. Built self-service tooling to allow developers to deploy code and infrastructure easily, without direct access to systems. Scaled deployments to thousands of application releases per month
Championed best practices with the development and product organizations. Helped lead cross-functional incident reviews to drive action items and build strong data around areas of concern. Worked with product to use this data to drive focused technical debt paydown and rearchitecture, with quantifiable ROI
Responsible for the availability, scalability, confugration, deployment, and monitoring of our online banking platform, serving millions of contracted users.
Built and enhanced CI, deployment and maintenance workflows in Jenkins and TeamCity
Designed and implemented an ELK Stack cluster for centralized logging, capable of handling the 10-25k events per second emitted from Dev, QA, Staging, and Production
Architected and wrote PowerShell modules and the fleet-wide deployment process used for all areas of application operations, deployment, and configuration management
Wrote custom web and Windows services for Monitoring / Operations and to integrate Jira, New Relic, and DynDNS/CloudFlare Data in our applications and Hipchat/Slack
Created infrastructure automation using Bash, PowerShell, Terraform, and Packer. Reduced new environment spin up time from weeks to hours
Created automated Testing Tools (Custom HttpClient, NUnit, and Selenium) to ensure quality releases, effective monitoring, and to power rollback decisioning
Worked directly with architecture and product engineering to shape and improve microservices strategy, and codify SRE requirements and NFRs