Daniil
Gavrilov

AI Researcher · Head of AI Research at T-Tech

Effective autism | ∃x : (x ∉ x) ∧ (x ∈ x) | Invest up to $100 at a time

Scroll

About

I see no fundamental reason AI can't eventually match human ability across every domain, but we're not there yet, and the bottleneck isn't scale. It's understanding. I want to push AI forward through a deep grasp of what's actually happening inside these models, whether that means building better training methods, figuring out how to solve novel tasks, or making systems we can actually interpret and trust.

I run AI Research at T-Tech with a flat team of researchers and students who publish at ICLR, ICML, NeurIPS, and ACL. No credential gatekeeping, capability is what matters. I chose industry research over the traditional academic path for the freedom to build methods that didn't exist before, and that's still what drives the work.

Research

Focused on LLM alignment, mechanistic interpretability, and efficient computation for reasoning. Papers at ICLR, ICML, NeurIPS, ACL, EMNLP, and EACL.

Alignment & RL

Direct alignment, RL training signals, controllable and safe generation for language and vision-language models.

Interpretability

Sparse autoencoders, feature flow, representation matching, and mechanistic steering of language models.

Efficient Computation

Adaptive depth and pondering, learnable kernels for efficient in-context modeling.

Experience

T-Tech

Head of AI Research

2021 — Present

Replika

Senior Research Engineer

2021

MIPT

Head of Lab

2020 — 2021

Research Engineer

2018 — 2021

Publications

AAMAS 2026 · Oral

Enhancing VLM Training with RL in Synthetic Worlds

Introduces VL-DAC, a reinforcement learning algorithm that trains vision-language models in inexpensive synthetic environments while achieving strong real-world generalization. By decoupling PPO updates for action tokens from value learning, the method improves convergence and produces policies with up to 50% relative gains on control tasks without compromising general image understanding.

G. Bredis, S. Dereka, V. Sinii, R. Rakhimov, D. Gavrilov

ICML 2025

Analyze Feature Flow to Enhance Interpretation and Steering

Maps sparse autoencoder features across consecutive layers of large language models using a data-free cosine similarity technique. The resulting flow graphs reveal how features persist, transform, or emerge at each stage, enabling fine-grained mechanistic interpretability and targeted steering of model behavior by amplifying or suppressing chosen features.

D. Laptev, N. Balagansky, Y. Aksenov, D. Gavrilov

ICLR 2025

Mechanistic Permutability: Match Features Across Layers

Introduces SAE Match, a data-free method for aligning sparse autoencoder features across different layers by minimizing error between folded autoencoder parameters. Tested on Gemma 2, the approach effectively tracks feature dynamics and approximates hidden states across layers, advancing mechanistic interpretability of neural networks.

N. Balagansky, I. Maksimov, D. Gavrilov

ICLR 2025

Learn Your Reference Model for Real Good Alignment

Proposes Trust Region alignment methods that dynamically adjust the reference policy during offline LLM training to prevent overoptimization. The TR-DPO, TR-IPO, and TR-KTO variants demonstrate improved performance across dialogue, summarization, and general-purpose benchmarks including AlpacaEval 2 and Arena-Hard.

A. Gorbatovski, B. Shaposhnikov, A. Malakhov, N. Surnachev, Y. Aksenov, I. Maksimov, N. Balagansky, D. Gavrilov

EMNLP 2025

Train One SAE Across Multiple Sparsity Budgets

Presents HierarchicalTopK, a training approach enabling a single sparse autoencoder to optimize across multiple sparsity levels simultaneously, eliminating the need for separate models per budget. Tested on Gemma-2 2B, the method achieves superior sparsity-vs-explained-variance tradeoffs while preserving high interpretability scores even at higher sparsity.

N. Balagansky, Y. Aksenov, D. Laptev, V. Kurochkin, G. Gerasimov, N. Koryagin, D. Gavrilov

EMNLP 2025

Steering LLM Reasoning Through Bias-Only Adaptation

Shows that training a single d-dimensional steering vector per layer with RL, while freezing all base weights, matches fully RL-tuned reasoning models on mathematical tasks. Adding only ~0.0016% parameters to an 8B model, the approach drastically reduces optimizer memory and inter-GPU communication while logit-lens analysis reveals the learned vectors amplify coherent token directions.

V. Sinii, A. Gorbatovski, A. Cherepanov, B. Shaposhnikov, N. Balagansky, D. Gavrilov

COLM 2025

Teach Old SAEs New Domain Tricks with Boosting

Proposes a residual learning method where a secondary sparse autoencoder is trained to model the reconstruction error of an existing SAE on specialized texts. By combining both models' outputs during inference, the approach captures domain-specific features the primary model missed while preserving general task performance.

N. Koriagin, Y. Aksenov, D. Laptev, G. Gerasimov, N. Balagansky, D. Gavrilov

Preprint

Train SAEs Efficiently by Utilizing Features Correlation

Introduces KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead for training sparse autoencoders at scale. Also proposes mAND, a differentiable activation function approximating binary AND that improves interpretability in the factorized framework.

V. Kurochkin, Y. Aksenov, D. Laptev, D. Gavrilov, N. Balagansky

ICLR 2025 · WS: Building Trust in LMs

The Differences Between Direct Alignment Algorithms are a Blur

Examines how direct alignment algorithms differ across SFT stages, scalar scores, and ranking objectives. Reveals that one-stage methods like ORPO improve substantially in two-stage setups, and that the choice between pairwise and pointwise objectives matters more than the specific scoring function—with implications for avoiding overstated superiority claims in alignment research.

A. Gorbatovski, B. Shaposhnikov, V. Sinii, A. Malakhov, D. Gavrilov

ACL 2024

Linear Transformers with Learnable Kernel Functions

Presents a modification to the Based linear transformer kernel that amplifies in-context learning abilities. While subquadratic architectures like state space models revealed deficiencies in in-context learning, this work shows a singular alteration to the Taylor-expansion-inspired kernel improves both multi-query associative recall and overall language modeling on the Pile dataset.

Y. Aksenov, N. Balagansky, S. M. Lo Cicero Vaina, B. Shaposhnikov, A. Gorbatovski, D. Gavrilov

NeurIPS 2022 · Spotlight

PALBERT: Teaching ALBERT to Ponder

Introduces a deterministic Q-exit criterion and revised architecture to improve adaptive computation time in pre-trained models. Applied to ALBERT and RoBERTa, the approach addresses the variance issues of PonderNet's stochastic exit sampling, achieving significant improvements over the original architecture and surpassing PABEE across multiple GLUE tasks.

N. Balagansky, D. Gavrilov

NeurIPS 2022 · WS: TL4NLP

Classifiers are Better Experts for Controllable Text Generation

Introduces CAIF sampling, a technique for directing text generation by using classifiers to modify language model logits at inference time. Tested on toxicity reduction and sentiment control, the method outperforms PPLM, GeDi, and DExperts baselines while offering greater simplicity and fewer implementation constraints.

A. Sitdikov, N. Balagansky, D. Gavrilov, A. Markov

EACL 2021

Implicit Unlikelihood Training: Improving Neural Text Generation with RL

Proposes fine-tuning language models with policy gradient reinforcement learning to directly optimize generation quality. When combined with unlikelihood training to minimize repetition, the approach further reduces dull and repetitive text without degrading language model quality, outperforming other training-time and decoding-time methods.

E. Lagutin, D. Gavrilov, P. Kalaidin

Get in Touch

Email Google Scholar GitHub X Hugging Face

B.Sc. Applied Mathematics — Saint Petersburg State University, 2019
Forbes 30 Under 30 (Science & Tech), 2025 · Setters Media A-List, 2025

DaniilGavrilov