PUNEETH
KOTHA

Software Engineer

Machine Learning, Distributed Systems

01

SELECTED WORK

01+

Flint: LLM Workflow Engine

Infrastructure2024

Problem: Building complex LLM workflows requires code changes for every adjustment. Need natural language to executable workflow conversion.

Approach: LLM-powered compiler translating natural-language descriptions into executable DAGs. Parallel asyncio with topological scheduling, schema-based corruption detection, and exponential backoff retry scheduling.

10K+Exec/min
<12msp95 Latency
02+

AuditAI

LLM Platform2025

Problem: Manual clinical note review is time-consuming and error-prone, requiring extraction of compliance fields across multiple rule categories.

Approach: LLM-powered auditing platform with async FastAPI backend, Redis deduplication, query-optimized PostgreSQL, and React dashboard with real-time risk scoring.

-60%Review Time
<500msAudit Latency
03+

Falcon: ML Inference Platform

MLOps2024

Problem: ML models deployed as single instances become bottlenecks. Need scalable, fault-tolerant inference with observability.

Approach: Multi-worker FastAPI inference service behind Nginx with Redis caching, idempotency-key deduplication, circuit breaker, exponential backoff retries, and graceful shutdown with in-flight request draining.

-30%Latency
50+Concurrent Requests
04+

Orbis ML Classifier

ML Research · NYU2025

Problem: Classifying 1.4M parent entities into Individual, Company, Family Firm, or Government across 120+ countries with inconsistent naming conventions and multilingual text.

Approach: Three-stage ML cascade: deterministic rules (~56% coverage), LLM API batch inference via Claude Haiku 3.5, and fine-tuned XLM-RoBERTa with confidence-threshold abstention.

98.75%Precision
1.4MEntities
05+

LOSight

Healthcare ML2024

Problem: Hospitals need to predict extended-stay risk at admission for proactive discharge planning and cost management.

Approach: Modeled extended-stay risk across 1.78M NYS inpatient discharges using SPARCS 2024 data. Benchmarked five classification models under class imbalance, applying SMOTE and threshold tuning.

78%Precision
1.78MDischarges
06+

ViT Optimization

Computer Vision2024

Problem: Vision Transformers are computationally expensive for production deployment, requiring significant latency reduction.

Approach: 4-bit NF4 quantization + FlashAttention-2 achieving 40.5% latency reduction and 449 img/sec throughput.

-40.5%Latency
449 img/secThroughput
07+

StockStream

Distributed Systems2024

Problem: Financial analytics require real-time processing with sub-second latency and fault tolerance under high-throughput loads.

Approach: Kafka and Spark streaming pipeline with hybrid PostgreSQL + InfluxDB storage optimized for time-series queries. Implemented consumer groups, checkpointing, and Grafana monitoring.

5K+Events/sec
<50msLatency
02

EXPERIENCE

Software Engineer

New York University

Jan 2025 — Present

  • Architected three-stage ML cascade (deterministic rules, LLM batch inference, fine-tuned XLM-RoBERTa) classifying 1.4M entities across 120+ countries at 98.75% macro precision
  • Fine-tuned XLM-RoBERTa on 690K multilingual samples with confidence-threshold abstention, improving classification accuracy 53% over baseline and reducing misclassification rate to under 1.25%
  • Deployed async LLM inference pipeline via Anthropic Batches API processing 500K+ multilingual entities end-to-end, eliminating 6+ months of manual labeling
  • Built interactive web interface with metrics dashboard, workflow simulation, and per-country data visualization using Python, Chart.js, and static site generation
  • Engineered async batch processing system handling concurrent API requests with exponential backoff retry logic and rate limit management
PyTorchXLM-RoBERTaPythonAsync I/OAnthropic APITransformerspandas

Software Engineer Intern

1INME

Aug 2023 — May 2024

  • Designed and owned Node.js/TypeScript REST backend on AWS serving real-time bidirectional sync traffic, cutting synchronization latency 60% via API refactoring and composite index optimization across 3 core endpoints
  • Implemented OAuth2 authentication flows with secure token management and session handling, ensuring compliance with security best practices
  • Built GitHub Actions CI/CD pipelines enforcing automated testing, code quality checks, and weekly production deployments with zero-downtime rolling updates and automated rollback on test failure
  • Owned end-to-end AWS service reliability across staging and production environments - configured Application Load Balancer, health checks, auto-scaling policies, and environment parity, reducing deployment-related incidents by 35%
  • Developed cross-platform Flutter mobile app with personalized features and real-time data sync, resulting in 40% increase in user engagement metrics
Node.jsTypeScriptAWSPostgreSQLOAuth2GitHub ActionsFlutterREST APIs

Machine Learning Engineer Intern

Menorah AI

Jan 2022 — May 2022

  • Built production NLP inference backend integrating BERT-based intent classification models, improving recognition accuracy 25% over baseline rule-based system
  • Optimized concurrent API request handling via connection pooling and async processing, reducing response latency 30% under parallel production loads
  • Implemented model versioning and A/B testing infrastructure for comparing intent classification performance across model iterations
  • Designed monitoring dashboard tracking key inference metrics: request throughput, latency percentiles (p50/p95/p99), error rates, and model confidence scores
PythonBERTNLPAPI OptimizationDialogFlowFastAPIAsync Processing
03

RESEARCH & EXPERIMENTS

Multi-stage Classification Pipelines

When classifying heterogeneous data, single models struggle. A cascade approach works better: rules catch obvious cases at near-perfect precision. LLMs handle ambiguity. Transformers specialize in multilingual text. The key is calibrating thresholds for when to escalate to the next stage.

Async Processing for LLM APIs

LLM API rate limits are the bottleneck. Batch requests asynchronously (100-500 at a time), use exponential backoff for retries, and implement request queuing with priority levels. This improved throughput 10x over sequential calls.

When to Fine-tune vs Prompt

Fine-tuning: 10K+ labeled examples, consistent behavior needed, latency matters. Prompting: <1K examples, task changes frequently, flexible reasoning needed. Hybrid: use prompting for cold start, collect data, fine-tune once patterns emerge.

Building Observable Systems

Metrics-first architecture changes how you debug production systems. Instrument at boundaries: request/response cycles, external API calls, database queries. Use percentiles (p50/p95/p99) over averages. Structured logging with request IDs lets you trace execution paths across services.

04

ABOUT

"Building systems at the intersection of distributed infrastructure and machine learning."

Graduate Research Assistant at NYU Stern, building ML classification systems that process millions of multilingual entities. Previously designed production APIs, ML inference pipelines, and distributed systems at 1INME and Menorah AI. Passionate about building scalable infrastructure that makes complex data systems reliable and performant. Experienced in end-to-end ML pipelines, distributed architectures, and production-grade software engineering.

Leadership & Recognition

Event Director · Google Developer Group NYU Tandon

President · CodingBrigade BVRIT (2021–2023)

President · SPICES (AICTE) BVRIT (2021–2022)

GyanDhan Merit-Based Scholarship · 2024

M.S. Computer Engineering

New York University

2024–2026 · GPA 3.8/4.0

B.Tech Computer Science

JNTU Hyderabad

2020–2024

LET'S
CONNECT

Puneeth Kotha · Software Engineer · NYU · 2026

New York, NY