Creating a best in-class, multi-intent TV experiece for millions of audiophiles worldwide

OVERVIEW

The TV experience has historically optimized for uninterrupted, lean-back playback, with minimal controls and limited interactivity. This made sense when TV was primarily an audio companion.

But as Spotify’s strategy increasingly emphasizes video, social listening, and in-session interaction, and as user habits are shifting towards wanting a TV experience that adapts with them, the existing Now Playing View (NPV) became a structural bottleneck rather than a foundation.

Because the legacy engineering architecture was built around a single playback mode, mode switches and layout changes introduced friction that easily broke immersion.

The old NPV simply could not support several different intents at once; having the queue open, with lyrics and a music video playing was simply not possible from a UX perspective.
So how might the NPV become a predictable anchor and system that adapts to user intent, without sacrificing the core lean-back experience TV is built for?

OPPORTUNITY

Intent not features should drive and
shape the system

From research and prior initiatives, a clear pattern emerged:

Users typically remain in a lean-back state, punctuated by brief, intentional moments of interaction such as checking the queue, joining a Jam, pivoting content or switching formats.

Any interaction that triggers a layout or mode change breaks immersion. This is regardless of whether the user’s intent is momentary or sustained.

The system provided no structural cues to distinguish passive listening from intentional engagement, causing all interactions to be handled through the same interaction model.

This resulted in us being unable to pick up reliable user signals and surfacing the right feature or content, at the right time.

SOLUTION

Rather than treating playback as a single, static state, the new TV NPV is designed as three distinct modes, each aligned to a different user journey:

Dive deeper… with the modular NPV:

The modular NPV supports moments where users intentionally lean into what’s currently playing.

This mode transforms the NPV to accommodate both video and deeper, more meaningful interactions by resizing the playback viewport and introducing a scrollable side panel. This allows users to interact with foreground features (such as queue, Jam, About the song, or contextual metadata) without leaving playback or compromising the video experience.

What this unlocks:

Foreground engagement without navigation away from the NPV
A clear spatial signal that the user is in an intentional, interactive state
A scalable container for future in-session features

Pivot content… with a simple scroll:

Pivot supports lightweight discovery and session extension when users want to explore something related.

This mode populates the below-the-fold scroll with contextual content such as chapters, related videos, and personalized video recommendations. It enables users to pivot, rabbit-hole, and extend sessions without needing to leave the NPV or initiate a new navigation flow.

This mirrors pivot behaviors users expect from video-first platforms like YouTube, while remaining grounded in the playback context.

What this unlocks:

Discovery without search or navigation
Seamless transitions between related content
Clear behavioral signals for exploratory intent

Lean back with… consume mode:

Consume is the default mode, optimized for passive, ambient listening and viewing.

This mode maintains the uninterrupted, cinema-like experience expected from TV, ensuring the NPV harmoniously fits into users’ environments.

Improvements to the ambient view, such as better artwork handling when no artist imagery is available, reinforce TV’s role as a lean-back surface.

What this preserves:

Immersion and simplicity
Minimal interaction cost
TV’s role as a background or communal medium

CORE FLOWS

The TV NPV is designed around clear, predictable entry points that allow users to move between modes based on intent: without requiring explicit mode switches or disrupting playback. Each transition is lightweight, intentional, and reversible.

All sessions begin in consume mode:
Playback starts in a full-bleed, ambient state optimized for passive listening and viewing. This establishes a predictable baseline that requires no interaction and supports TV’s lean-back nature.

Want to control the queue, invite others to listen with you or learn more about what you’re listening to?
Users dive deeper when they explicitly signal intent to engage with what’s currently playing.

This transition is triggered through foreground actions such as opening the queue, viewing artist or track context, or initiating social features.

The NPV responds by resizing the viewport and revealing a scrollable side panel, without leaving playback.

Pivoting is just a press away
Users can scroll below the fold from the main NPV to find related content.

This gesture signals exploratory intent, allowing users to discover related or adjacent content without leaving the playback context. Pivot does not reconfigure the NPV layout; instead, it extends the session vertically.

Selecting new content transitions the session forward, while backing out returns the user to the original playback state.

Enrich your viewing experience, in whichever mode
Enriching overlays can be activated on top of both full-screen consume mode and modular NPV, without triggering a mode transition.

These overlays respond to short, expressive intents, such as checking lyrics or enabling dark mode, and are designed to appear and disappear without altering the underlying layout.

USABILITY TESTING

To test whether the three-mode model and its core flows held up in real use, we moved beyond Figma and built a Cursor-powered coded prototype directly on top of the existing TV codebase. This allowed participants to use the experience with real content, real data, and real navigation flows, closely mirroring how the NPV would behave in a production environment.

Using a coded prototype was especially important for TV, where interaction cost, focus management, and navigation friction are difficult to evaluate in static or semi-interactive mocks.

Building on top of the existing TV experience gave us:

A fully functional prototype with minimal bespoke scaffolding
Production-level focus behavior, remote input, and navigation constraints
Exposure to real, imperfect content (e.g. artwork quality, profiles, metadata), reducing “polish bias” during testing

This enabled participants to behave more naturally, e.g. skipping instructions, relying on visual cues, and navigating through trial and error. Which surfaced issues and opportunities that would not have appeared in a Figma prototype.

We carried out the usability test from an in-person UX research lab in London.

This is what the Cursor-coded prototype looked like in real life, on a real TV, using real content.

KEY USER INSIGHTS
Testing validated that navigation friction on TV can be significantly reduced through focus alignment and clearer visual hierarchy, improving flow continuity with minimal iteration.

Delightful moments

Participants consistently discovered and enjoyed the About section, naturally using it as an entry point for further exploration rather than a terminal view.

Fluid movement

Exploration below-the-fold felt intuitive once revealed, and users moved fluidly between the main NPV, modular NPV, and scroll, reinforcing the assumption that TV usage oscillates between intents rather than following a linear path.

Baseline features worked

Core features such as queue, lyrics, and audio/video transitions in this new system were immediately understood.

WHAT THIS UNLOCKS

ML & data quality
Beyond just a fundamental UX improvement, by differentiating passive consumption from intentional actions (pivoting, enriching, diving deeper), the new NPV can provide cleaner behavioral signals to personalization systems, turning playback into an active input to discovery models rather than a dead end.

CLOSING WORDS AND REFLECTIONS

Designing for intent requires designing for absence
One of the hardest constraints on TV is that most of the time, nothing is happening. Users are not holding a remote, not looking at the screen, and not signaling intent.

This work reinforced that designing for TV is less about maximizing interaction and more about preserving the passive default while being ready for brief, intentional moments. The system had to work just as hard when users did nothing as when they did something.

Systems age better than features
The original pressure was to “add more capability” to the NPV, but it became clear that adding features without changing the underlying structure would only increase fragmentation and engineering cost.

Framing the problem as a system architecture challenge (rather than a feature roadmap) allowed the team to create something that can scale with video, social, and future formats without repeated rework.

Fidelity matters when testing interaction cost
Using a Cursor-powered prototype fundamentally changed what we learned. Real focus behavior, real content, and real navigation friction surfaced issues that would not have appeared in Figma. For TV, production-like fidelity is often required to evaluate experience quality meaningfully.

Creating a best in-class, multi-intent TV experiece for millions of audiophiles worldwide

Intent not features should drive and shape the system

Dive deeper… with the modular NPV:

Pivot content… with a simple scroll:

Lean back with… consume mode:

Intent not features should drive and
shape the system