About the Role
Tata Consultancy Services (TCS) is an IT services, consulting and business solutions organization that has been partnering with many of the world’s largest businesses in their transformation journeys for over 50 years. TCS offers a consulting-led, cognitive powered, integrated portfolio of business, technology and engineering services and solutions. This is delivered through its unique Location Independent Agile™ delivery model, recognized as a benchmark of excellence in software development.
A part of the Tata group, India’s largest multinational business group, TCS has over 616,171 of the world’s best-trained consultants with 157 nationalities in 53 countries. For more information, visit www.tcs.com and follow TCS news at @TCS_News.
Provide day-to-day monitoring and operational support of Gernas, its dependent AI applications, deployed AI agents, agent runtime environments, AI gateways, and supporting Azure and AWS cloud services. Ensure continuous availability and consistent service performance across production environments, proactively identifying degradation signals before they become user-visible incidents. Maintain operational dashboards covering platform health, agent execution, LLM consumption, gateway throughput, and integration endpoints, and ensure that ownership of every production component is unambiguous and documented.
Incident and Problem Management
Key Responsibilities
- Lead L2 and L3 incident response across the Gernas platform and its surrounding AI products. Conduct structured root-cause analysis, drive service restoration within agreed P1–P4 service-level commitments, and coordinate major incidents involving multiple engineering, cloud, security, and vendor stakeholders. Maintain problem records for recurring or systemic issues, ensure that post-incident reviews are completed with clear corrective and preventive actions, and own the closure of those actions through to engineering remediation. Act as the bridge between the L1 AI Operations Centre and L3 engineering teams during complex incidents.
- Troubleshoot the behaviour of single-agent and multi-agent workflows, including agent orchestration, tool invocation, skill execution, prompt construction, memory and context handling, planning loops, and inter-agent coordination. Diagnose failures across Microsoft Agent Framework and LangGraph execution, including state-graph traversal, conditional routing, tool-call mismatches, retries, and human-in-the-loop checkpoints. Investigate integrations with enterprise systems, MCP-exposed tools, and downstream APIs, and work with engineering teams to harden agent designs where production data reveals weaknesses.
- Support the operational stability of the model-serving layer across Azure OpenAI PTU deployments, AWS Bedrock, and Core42 Compass.
How to Apply
More jobs at get9to5jobs.com