Netflix App Testing At Scale. Find out how Netflix handled the… | by Jose Alcérreca | Android Builders | Apr, 2025

April 19, 2025

23

That is a part of the Testing at scale collection of articles the place we requested trade specialists to share their testing methods. On this article, Ken Yee, Senior Engineer at Netflix, tells us in regards to the challenges of testing a playback app at an enormous scale and the way they’ve developed the testing technique because the app was created 14 years in the past!

Testing at Netflix constantly evolves. With a view to totally perceive the place it’s going and why it’s in its present state, it’s additionally vital to grasp the historic context of the place it has been.

The Android app was began 14 years in the past. It was initially a hybrid utility (native+webview), however it was transformed over to a completely native app due to efficiency points and the problem in with the ability to create a UI that felt/acted really native. As with most older purposes, it’s within the technique of being transformed to Jetpack Compose. The present codebase is roughly 1M strains of Java/Kotlin code unfold throughout 400+ modules and, like most older apps, there may be additionally a monolith module as a result of the unique app was one massive module. The app is dealt with by a workforce of roughly 50 folks.

At one level, there was a devoted cell SDET (Software program Developer Engineer in Take a look at) workforce that dealt with writing all gadget exams by following the standard movement of working with builders and product managers to grasp the options they had been testing to create check plans for all their automation exams. At Netflix, SDETs had been builders with a concentrate on testing; they wrote Automation exams with Espresso or UIAutomator; in addition they constructed frameworks for testing and built-in third get together testing frameworks. Function Builders wrote unit exams and Robolectric exams for their very own code. The devoted SDET workforce was disbanded just a few years in the past and the automation exams at the moment are owned by every of the function subteams; there are nonetheless 2 supporting SDETs who assist out the varied groups as wanted. QA (High quality Assurance) manually exams releases earlier than they’re uploaded as a closing “smoke check”.

Within the media streaming world, one fascinating problem is the large ecosystem of playback gadgets utilizing the app. We prefer to help a superb expertise on low reminiscence/sluggish gadgets (e.g. Android Go gadgets) whereas offering a premium expertise on larger finish gadgets. For foldables, some don’t report a hinge sensor. We help gadgets again to Android 7.0 (API24), however we’re setting our minimal to Android 9 quickly. Some manufacturer-specific variations of Android even have quirks. Consequently, bodily gadgets are an enormous a part of our testing

As talked about, function builders now deal with all points of testing their options. Our testing layers seem like this:

Take a look at Pyramid exhibiting layers from backside to high of: unit exams, screenshot exams, E2E automation exams, smoke exams

Nonetheless, due to our heavy utilization of bodily gadget testing and the legacy elements of the codebase, our testing pyramid appears extra like an hourglass or inverted pyramid relying on which a part of the code you’re in. New options do have this extra typical testing pyramid form.

Our screenshot testing can be carried out at a number of ranges: UI element, UI display structure, and gadget integration display structure. The primary two are actually unit exams as a result of they don’t make any community calls. The final is an alternative choice to most handbook QA testing.

Unit exams are used to check enterprise logic that isn’t depending on any particular gadget/UI habits. In older elements of the app, we use RxJava for asynchronous code and streams are examined. Newer elements of the app use Kotlin Flows and Composables for state flows that are a lot simpler to cause about and check in comparison with RxJava.

Frameworks we use for unit testing are:

Strikt: for assertions as a result of it has a fluent API like AssertJ however is written for Kotlin
Turbine: for the lacking items in testing Kotlin Flows
Mockito: for mocking any complicated courses not related for the present Unit of code being examined
Hilt: for substituting check dependencies in our Dependency Injection graph
Robolectric: for testing enterprise logic that has to work together not directly with Android companies/courses (e.g., parcelables or Providers)
A/B check/function flag framework: permits overriding an automation check for a selected A/B check or enabling/disabling a selected function

Builders are inspired to make use of plain unit exams earlier than switching to Hilt or Robolectric as a result of execution time goes up 10x with every step when going from plain unit exams -> Hilt -> Robolectric. Mockito additionally slows down builds when utilizing inline mocks, so inline mocks are discouraged. Gadget exams are a number of orders of magnitude slower than any of those varieties unit exams. Pace of testing is vital in massive codebases.

As a result of unit exams are blocking in our CI pipeline, minimizing flakiness is extraordinarily vital. There are usually two causes for flakiness: leaving some state behind for the following check and testing asynchronous code.

JVM (Java Digital Machine) Unit check courses are created as soon as after which the check strategies in every class are referred to as sequentially; instrumented exams compared are run from the beginning and the one time it can save you is APK set up. Due to this, if a check technique leaves some modified international state behind in dependent courses, the following check technique can fail. World state can take many types together with recordsdata on disk, databases on disk, and shared courses. Utilizing dependency injection or recreating something that’s modified solves this concern.

With asynchronous code, flakiness can at all times occur as a number of threads change various things. Take a look at Dispatchers (Kotlin Coroutines) or Take a look at Schedulers (RxJava) can be utilized to regulate time in every thread to make issues deterministic when testing a selected race situation. This may make the code much less sensible and probably miss some check situations, however will stop flakiness within the exams.

Screenshot testing frameworks are vital as a result of they check what’s seen vs. testing habits. Consequently, they’re the perfect alternative for handbook QA testing of any screens which are static (animations are nonetheless troublesome to check with most screenshot testing frameworks except the framework can management time).

We use a wide range of frameworks for screenshot testing:

Paparazzi: for Compose UI elements and display layouts; community calls can’t be made to obtain photographs, so it’s important to use static picture assets or a picture loader that pulls a sample for the requested photographs (we do each)
Localization screenshot testing: captures screenshots of screens within the operating app in all locales for our UX groups to confirm manually
Gadget screenshot testing: gadget testing used to check visible habits of the operating app

Espresso accessibility testing: that is additionally a type of screenshot testing the place the sizes/colours of varied parts are checked for accessibility; this has additionally been considerably of a ache level for us as a result of our UX workforce has adopted the WCAG 44dp normal for minimal contact dimension as an alternative of Android’s 48dp.

Lastly, we have now gadget exams. As talked about, these are magnitudes slower than exams that may run on the JVM. They’re a alternative for handbook QA and used to smoke check the general performance of the app.

Nonetheless, since operating a completely working app in a check has exterior dependencies (backend, community infra, lab infra), the gadget exams will at all times be flaky not directly. This can’t be emphasised sufficient: regardless of having retries, gadget automation exams will at all times be flaky over an prolonged time frame. Additional beneath, we’ll cowl what we do to deal with a few of this flakiness.

We use these frameworks for gadget testing:

Espresso: majority of gadget exams use Espresso which is Android’s most important instrumentation testing framework for consumer interfaces
PageObject check framework: inside screens are written as PageObjects that exams can management to ease migration from XML layouts to Compose (see beneath for extra particulars)
UIAutomator: a small “smoke check” set of exams makes use of UIAutomator to check the totally obfuscated binary that can get uploaded to the app retailer (a.ok.a., Launch Candidate exams)
Efficiency testing framework: measures load occasions of varied screens to test for any regressions
Community seize/playback framework: permits playback of recorded API calls to cut back instability of gadget exams
Backend mocking framework: exams can ask the backend to return particular outcomes; for instance, our dwelling web page has content material that’s totally pushed by advice algorithms so a check can’t deterministically search for particular titles except the check asks the backend to return particular movies in particular states (e.g. “leaving quickly”) and particular rows full of particular titles (e.g. a Coming Quickly row with particular movies)
A/B check/function flag framework: permits overriding an automation check for a selected A/B check or enabling/disabling a selected function
Analytics testing framework: used to confirm a sequence of analytics occasions from a set of display actions; analytics are probably the most vulnerable to breakage when screens are modified so this is a crucial factor to check.

The PageObject design sample began as an internet sample, however has been utilized to cell testing. It separates check code (e.g. click on on Play button) from screen-specific code (e.g. the mechanics of clicking on a button utilizing Espresso). Due to this, it allows you to summary the check from the implementation (assume interfaces vs. implementation when writing code). You’ll be able to simply substitute the implementation as wanted when migrating from XML Layouts to Jetpack Compose layouts however the check itself (e.g. testing login) stays the identical.

Along with utilizing PageObjects to outline an abstraction over screens, we have now an idea of “Take a look at Steps”. A check consists of check steps. On the finish of every step, our gadget lab infra will routinely create a screenshot. This offers builders a storyboard of screenshots that present the progress of the check. When a check step fails, it’s additionally clearly indicated (e.g., “couldn’t click on on Play button”) as a result of a check step has a “abstract” and “error description” discipline.

Inside of a device lab cage — Within a tool lab cage

Netflix was in all probability one of many first corporations to have a devoted gadget testing lab; this was earlier than third get together companies like Firebase Take a look at Lab had been obtainable. Our lab infrastructure has lots of options you’d anticipate to have the ability to do:

Goal particular sorts of gadgets
Seize video from operating a check
Seize screenshots whereas operating a check
Seize all logs

Attention-grabbing gadget tooling options which are uniquely Netflix:

Mobile tower so we will check wifi vs. mobile connections; Netflix has their very own bodily mobile tower within the lab that the gadgets are configured to connect with.
Community conditioning so sluggish networks will be simulated
Automated disabling of system updates to gadgets to allow them to be locked at a selected OS degree
Solely makes use of uncooked adb instructions to put in/run exams (all this infrastructure predates frameworks like Gradle Managed Units or Flank)
Working a collection of automated exams in opposition to an A/B exams
Take a look at {hardware}/software program for verifying {that a} gadget doesn’t drop frames for our companions to confirm their gadgets help Netflix playback correctly; we even have a qualification program for gadgets to ensure they help HDR and different codecs correctly.

Should you’re interested in extra particulars, take a look at Netflix’ tech weblog.

As talked about above, check flakiness is without doubt one of the hardest issues about inherently unstable gadget exams. Tooling needs to be constructed to:

Reduce flakiness
Determine causes of flakes
Notify groups that personal the flaky exams

Tooling that we’ve constructed to handle the flakiness:

Robotically identifies the PR (Pull Request) batch {that a} check began to fail in and notifies PR authors that they precipitated a check failure
Checks will be marked secure/unstable/disabled as an alternative of utilizing @Ignore annotations; that is used to disable a subset of exams quickly if there’s a backend concern in order that false positives aren’t reported on PRs
Automation that figures out whether or not a check will be promoted to Secure through the use of spare gadget cycles to routinely consider check stability
Automated IfTTT (If This Then That) guidelines for retrying exams or ignoring momentary failures or repairing a tool
Failure report allow us to simply filter failures based on what gadget maker, OS, or cage the gadget is in, e.g. this reveals how usually a check fails over a time frame for these environmental components:

Take a look at failures over time grouped by environmental components like staging/prod backend, OS model, cellphone/pill

Failure report lets us triage error historical past to determine the most typical failure causes for a check together with screenshots:

Checks will be manually set as much as run a number of occasions throughout gadgets or OS variations or gadget varieties (cellphone/pill) to breed flaky exams

Now we have a typical PR (Pull Request) CI pipeline that runs unit exams (contains Paparazzi and Robolectric exams), lint, ktLint, and Detekt. Working roughly 1000 gadget exams is a part of the PR course of. In a PR, a subset of smoke exams can be run in opposition to the totally obfuscated app that may be shipped to the app retailer (the earlier gadget exams run in opposition to {a partially} obfuscated app).

Further gadget automation exams are run as a part of our post-merge suite. At any time when batches of PRs are merged, there may be extra protection supplied by automation exams that can’t be run on PRs as a result of we attempt to preserve the PR gadget automation suite underneath half-hour.

As well as, there are Day by day and Weekly suites. These are run for for much longer automation exams as a result of we attempt to preserve our post-merge suite underneath 120 minutes. Automation exams that go into these are sometimes lengthy operating stress exams (e.g., are you able to watch a season of a collection with out the app operating out of reminiscence and crashing?).

In an ideal world, you will have infinite assets to do all of your testing. Should you had infinite gadgets, you possibly can run all of your gadget exams in parallel. Should you had infinite servers, you possibly can run all of your unit exams in parallel. Should you had each, you possibly can run every thing on each PR. However in the true world, you will have a balanced method that runs “sufficient” exams on PRs, postmerge, and many others. to stop points from getting out into the sphere so your clients have a greater expertise whereas additionally retaining your groups productive.

Protection on gadgets is a set of tradeoffs. On PRs, you wish to maximize protection however reduce time. On post-merge/Day by day/Weekly, time is much less vital.

When testing on gadgets, we have now a two dimensional matrix of OS model vs. gadget sort (cellphone/pill). Format points are pretty widespread, so we at all times run exams on cellphone+pill. We’re nonetheless including automation to foldables, however they’ve their very own challenges like with the ability to check layouts earlier than/after/in the course of the folding course of.

On PRs, we usually run what we name a “slim grid” which implies a check can run on any OS model. On Postmerge/Day by day/Weekly, we run what we name a “full grid” which implies a check runs on each OS model. The tradeoff is that if there may be an OS-specific failure, it could seem like a flaky check on a PR and gained’t be detected till later.

Testing constantly evolves as you be taught what works or new applied sciences and frameworks change into obtainable. We’re at the moment evaluating utilizing emulators to hurry up our PRs. We’re additionally evaluating utilizing Roborazzi to cut back device-based screenshot testing; Roborazzi permits testing of interactions whereas Paparazzi doesn’t. We’re build up a modular “demo app” system that permits for feature-level testing as an alternative of app-level testing. Bettering app testing by no means ends…

Supply hyperlink