Daniil
Gavrilov
AI Researcher · Head of AI Research at T-Tech
Effective autism | ∃x : (x ∉ x) ∧ (x ∈ x) | Invest up to $100 at a time
About
I see no fundamental reason AI can't eventually match human ability across every domain, but we're not there yet, and the bottleneck isn't scale. It's understanding. I want to push AI forward through a deep grasp of what's actually happening inside these models, whether that means building better training methods, figuring out how to solve novel tasks, or making systems we can actually interpret and trust.
I run AI Research at T-Tech with a flat team of researchers and students who publish at ICLR, ICML, NeurIPS, and ACL. No credential gatekeeping, capability is what matters. I chose industry research over the traditional academic path for the freedom to build methods that didn't exist before, and that's still what drives the work.
Research
Focused on LLM alignment, mechanistic interpretability, and efficient computation for reasoning. Papers at ICLR, ICML, NeurIPS, ACL, EMNLP, and EACL.
Alignment & RL
Direct alignment, RL training signals, controllable and safe generation for language and vision-language models.
Interpretability
Sparse autoencoders, feature flow, representation matching, and mechanistic steering of language models.
Efficient Computation
Adaptive depth and pondering, learnable kernels for efficient in-context modeling.
Experience
Publications
Introduces VL-DAC, a reinforcement learning algorithm that trains vision-language models in inexpensive synthetic environments while achieving strong real-world generalization. By decoupling PPO updates for action tokens from value learning, the method improves convergence and produces policies with up to 50% relative gains on control tasks without compromising general image understanding.
Maps sparse autoencoder features across consecutive layers of large language models using a data-free cosine similarity technique. The resulting flow graphs reveal how features persist, transform, or emerge at each stage, enabling fine-grained mechanistic interpretability and targeted steering of model behavior by amplifying or suppressing chosen features.
Introduces SAE Match, a data-free method for aligning sparse autoencoder features across different layers by minimizing error between folded autoencoder parameters. Tested on Gemma 2, the approach effectively tracks feature dynamics and approximates hidden states across layers, advancing mechanistic interpretability of neural networks.
Proposes Trust Region alignment methods that dynamically adjust the reference policy during offline LLM training to prevent overoptimization. The TR-DPO, TR-IPO, and TR-KTO variants demonstrate improved performance across dialogue, summarization, and general-purpose benchmarks including AlpacaEval 2 and Arena-Hard.
Presents HierarchicalTopK, a training approach enabling a single sparse autoencoder to optimize across multiple sparsity levels simultaneously, eliminating the need for separate models per budget. Tested on Gemma-2 2B, the method achieves superior sparsity-vs-explained-variance tradeoffs while preserving high interpretability scores even at higher sparsity.
Shows that training a single d-dimensional steering vector per layer with RL, while freezing all base weights, matches fully RL-tuned reasoning models on mathematical tasks. Adding only ~0.0016% parameters to an 8B model, the approach drastically reduces optimizer memory and inter-GPU communication while logit-lens analysis reveals the learned vectors amplify coherent token directions.
Proposes a residual learning method where a secondary sparse autoencoder is trained to model the reconstruction error of an existing SAE on specialized texts. By combining both models' outputs during inference, the approach captures domain-specific features the primary model missed while preserving general task performance.
Introduces KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead for training sparse autoencoders at scale. Also proposes mAND, a differentiable activation function approximating binary AND that improves interpretability in the factorized framework.
Examines how direct alignment algorithms differ across SFT stages, scalar scores, and ranking objectives. Reveals that one-stage methods like ORPO improve substantially in two-stage setups, and that the choice between pairwise and pointwise objectives matters more than the specific scoring function—with implications for avoiding overstated superiority claims in alignment research.
Presents a modification to the Based linear transformer kernel that amplifies in-context learning abilities. While subquadratic architectures like state space models revealed deficiencies in in-context learning, this work shows a singular alteration to the Taylor-expansion-inspired kernel improves both multi-query associative recall and overall language modeling on the Pile dataset.
Introduces a deterministic Q-exit criterion and revised architecture to improve adaptive computation time in pre-trained models. Applied to ALBERT and RoBERTa, the approach addresses the variance issues of PonderNet's stochastic exit sampling, achieving significant improvements over the original architecture and surpassing PABEE across multiple GLUE tasks.
Introduces CAIF sampling, a technique for directing text generation by using classifiers to modify language model logits at inference time. Tested on toxicity reduction and sentiment control, the method outperforms PPLM, GeDi, and DExperts baselines while offering greater simplicity and fewer implementation constraints.
Proposes fine-tuning language models with policy gradient reinforcement learning to directly optimize generation quality. When combined with unlikelihood training to minimize repetition, the approach further reduces dull and repetitive text without degrading language model quality, outperforming other training-time and decoding-time methods.
Get in Touch
Forbes 30 Under 30 (Science & Tech), 2025 · Setters Media A-List, 2025