PUNEETHKOTHA
Software Engineer
Machine Learning, Distributed Systems
SELECTED WORK
Flint: LLM Workflow Engine
Problem: Building complex LLM workflows requires code changes for every adjustment. Need natural language to executable workflow conversion.
Approach: LLM-powered compiler translating natural-language descriptions into executable DAGs. Parallel asyncio with topological scheduling, schema-based corruption detection, and exponential backoff retry scheduling.
AuditAI
Problem: Manual clinical note review is time-consuming and error-prone, requiring extraction of compliance fields across multiple rule categories.
Approach: LLM-powered auditing platform with async FastAPI backend, Redis deduplication, query-optimized PostgreSQL, and React dashboard with real-time risk scoring.
Falcon: ML Inference Platform
Problem: ML models deployed as single instances become bottlenecks. Need scalable, fault-tolerant inference with observability.
Approach: Multi-worker FastAPI inference service behind Nginx with Redis caching, idempotency-key deduplication, circuit breaker, exponential backoff retries, and graceful shutdown with in-flight request draining.
Orbis ML Classifier
Problem: Classifying 1.4M parent entities into Individual, Company, Family Firm, or Government across 120+ countries with inconsistent naming conventions and multilingual text.
Approach: Three-stage ML cascade: deterministic rules (~56% coverage), LLM API batch inference via Claude Haiku 3.5, and fine-tuned XLM-RoBERTa with confidence-threshold abstention.
LOSight
Problem: Hospitals need to predict extended-stay risk at admission for proactive discharge planning and cost management.
Approach: Modeled extended-stay risk across 1.78M NYS inpatient discharges using SPARCS 2024 data. Benchmarked five classification models under class imbalance, applying SMOTE and threshold tuning.
ViT Optimization
Problem: Vision Transformers are computationally expensive for production deployment, requiring significant latency reduction.
Approach: 4-bit NF4 quantization + FlashAttention-2 achieving 40.5% latency reduction and 449 img/sec throughput.
StockStream
Problem: Financial analytics require real-time processing with sub-second latency and fault tolerance under high-throughput loads.
Approach: Kafka and Spark streaming pipeline with hybrid PostgreSQL + InfluxDB storage optimized for time-series queries. Implemented consumer groups, checkpointing, and Grafana monitoring.
EXPERIENCE
Software Engineer
New York University
Jan 2025 — Present
- →Architected three-stage ML cascade (deterministic rules, LLM batch inference, fine-tuned XLM-RoBERTa) classifying 1.4M entities across 120+ countries at 98.75% macro precision
- →Fine-tuned XLM-RoBERTa on 690K multilingual samples with confidence-threshold abstention, improving classification accuracy 53% over baseline and reducing misclassification rate to under 1.25%
- →Deployed async LLM inference pipeline via Anthropic Batches API processing 500K+ multilingual entities end-to-end, eliminating 6+ months of manual labeling
- →Built interactive web interface with metrics dashboard, workflow simulation, and per-country data visualization using Python, Chart.js, and static site generation
- →Engineered async batch processing system handling concurrent API requests with exponential backoff retry logic and rate limit management
Software Engineer Intern
1INME
Aug 2023 — May 2024
- →Designed and owned Node.js/TypeScript REST backend on AWS serving real-time bidirectional sync traffic, cutting synchronization latency 60% via API refactoring and composite index optimization across 3 core endpoints
- →Implemented OAuth2 authentication flows with secure token management and session handling, ensuring compliance with security best practices
- →Built GitHub Actions CI/CD pipelines enforcing automated testing, code quality checks, and weekly production deployments with zero-downtime rolling updates and automated rollback on test failure
- →Owned end-to-end AWS service reliability across staging and production environments - configured Application Load Balancer, health checks, auto-scaling policies, and environment parity, reducing deployment-related incidents by 35%
- →Developed cross-platform Flutter mobile app with personalized features and real-time data sync, resulting in 40% increase in user engagement metrics
Machine Learning Engineer Intern
Menorah AI
Jan 2022 — May 2022
- →Built production NLP inference backend integrating BERT-based intent classification models, improving recognition accuracy 25% over baseline rule-based system
- →Optimized concurrent API request handling via connection pooling and async processing, reducing response latency 30% under parallel production loads
- →Implemented model versioning and A/B testing infrastructure for comparing intent classification performance across model iterations
- →Designed monitoring dashboard tracking key inference metrics: request throughput, latency percentiles (p50/p95/p99), error rates, and model confidence scores
RESEARCH & EXPERIMENTS
Multi-stage Classification Pipelines
When classifying heterogeneous data, single models struggle. A cascade approach works better: rules catch obvious cases at near-perfect precision. LLMs handle ambiguity. Transformers specialize in multilingual text. The key is calibrating thresholds for when to escalate to the next stage.
Async Processing for LLM APIs
LLM API rate limits are the bottleneck. Batch requests asynchronously (100-500 at a time), use exponential backoff for retries, and implement request queuing with priority levels. This improved throughput 10x over sequential calls.
When to Fine-tune vs Prompt
Fine-tuning: 10K+ labeled examples, consistent behavior needed, latency matters. Prompting: <1K examples, task changes frequently, flexible reasoning needed. Hybrid: use prompting for cold start, collect data, fine-tune once patterns emerge.
Building Observable Systems
Metrics-first architecture changes how you debug production systems. Instrument at boundaries: request/response cycles, external API calls, database queries. Use percentiles (p50/p95/p99) over averages. Structured logging with request IDs lets you trace execution paths across services.
ABOUT
"Building systems at the intersection of distributed infrastructure and machine learning."
Graduate Research Assistant at NYU Stern, building ML classification systems that process millions of multilingual entities. Previously designed production APIs, ML inference pipelines, and distributed systems at 1INME and Menorah AI. Passionate about building scalable infrastructure that makes complex data systems reliable and performant. Experienced in end-to-end ML pipelines, distributed architectures, and production-grade software engineering.
Leadership & Recognition
Event Director · Google Developer Group NYU Tandon
President · CodingBrigade BVRIT (2021–2023)
President · SPICES (AICTE) BVRIT (2021–2022)
GyanDhan Merit-Based Scholarship · 2024
M.S. Computer Engineering
New York University
2024–2026 · GPA 3.8/4.0
B.Tech Computer Science
JNTU Hyderabad
2020–2024