Notes
2025
0925 Demo Interview Higher Seniority

Awesome—let’s level this up to Senior/Staff expectations. Below are all 4 sections (≈1 hour each), each with 12 senior‑level questions and detailed sample answers (the kind of depth you’d expect from a strong candidate for NVIDIA SWQA Test Development).


Section 1 — QA Methodology, Linux & Docker (Senior)

Goals: Strategy at scale, deep Linux troubleshooting, container security/perf, and production‑grade reproducibility.


Q1. Architecting a risk‑based test strategy for a GPU driver release across multiple OSes (Ubuntu/RHEL/Yocto). What’s your plan, exit criteria, and reporting?

Sample Answer (Senior):

  • Plan:
    • Risk matrix by surface: kernel module (stability, ABI), user‑space lib (compat), toolchain, packaging, installer.
    • Test pyramid adapted to systems: unit (driver stubs), component (ioctl, memory mgmt, concurrency), integration on bare‑metal + KVM, HIL for boards, soak/stress (24–72h).
    • Coverage artifacts: RTM → requirements/features; code coverage (gcov/kcov for kernel pieces where viable), scenario coverage (RAG style workloads: training/inference, video, CUDA samples).
  • Exit criteria:
    • No P0/P1 open; P2 risk accepted with documented mitigation.
    • Soak: mean time to failure (MTTF) > target; perf regression < 3% vs baseline; installer rollback verified.
    • Security: image SBOM scanned; no High/Critical CVEs unmitigated.
  • Reporting:
    • Daily burn‑down, defect arrival/closure, Defect Removal Efficiency (DRE), pass/fail heatmaps by platform, flakiness index (<2% target).

Q2. You inherit a flaky test suite (2.8% flake rate). What governance and technical steps do you implement to get it <0.5%?

Sample Answer:

  • Governance: flake label & auto‑quarantine, flake budget per team, SLA to fix within sprint, CI gate forbids new flakes.
  • Technical: deterministic seeds; eliminate time‑based waits → condition waits; idempotent setup/teardown; isolate external deps via service virtualization; record/replay for APIs; retry‑with‑trace only during diagnosis; stability CI lane (10x reruns) to measure.
  • Observability: structured logs + trace IDs, attach artifact bundles (logs, dmesg, nvidia‑smi, kernel traces).
  • Outcome metric: weekly flake burndown, Pareto by root cause.

Q3. Severe customer crash with minimal logs. How do you capture and analyze a Linux kernel panic tied to a GPU driver?

Sample Answer:

  • Enable kdump; ensure crashkernel reserved; reproduce and collect vmcore.
  • Analyze with crash/gdb + vmlinux; inspect backtraces of GPU driver threads/interrupts; check lockdep, refcounts.
  • Collect dmesg, /sys/kernel/debug stats, GPU error registers if exposed; verify module versions via modinfo and build IDs.
  • Correlate with power/thermal via nvidia-smi --query-gpu and IPMI/BMC telemetry.
  • Create min‑repro harness; add bisect across driver commits if needed.

Q4. Deep‑dive Linux perf triage: diagnosing intermittent I/O latency spikes during model training. What tools/steps?

Sample Answer:

  • Start with iostat -x 1, vmstat 1, sar, perf top/record, bpftop/eBPF (biolatency, runqlat, offcpu), pidstat -d -u -t.
  • For block layer: blktrace/btt, check queue depth, scheduler, NVMe firmware.
  • NUMA & PCIe: lspci -vv, numactl --hardware, confirm GPU & NIC/SSD topology (avoid cross‑socket).
  • Mitigation: pinning threads, I/O scheduler tuning (mq‑deadline), increase read‑ahead for streaming, ensure hugepages/pinned memory sane, validate IRQ affinity.

Q5. Explain Docker isolation: namespaces, cgroups, capabilities, seccomp/AppArmor—and how you’d harden a test container.

Sample Answer:

  • Namespaces: pid, net, mnt, ipc, uts, user isolate views. cgroups v2 limit CPU/mem/io.
  • Capabilities: start from --cap-drop=ALL and add minimal required (e.g., CAP_NET_BIND_SERVICE).
  • seccomp: default profile to restrict syscalls; AppArmor/SELinux for MAC.
  • Rootless containers for CI; read‑only root FS; dedicated network policy if on K8s.
docker run --read-only --pids-limit=512 \
  --cap-drop=ALL --security-opt no-new-privileges \
  --memory=4g --cpus=2 --tmpfs /tmp:rw,size=256m myimage:sha256...

Q6. Build reproducible, minimal, and secure images for Python tests that run CUDA tools. How?

Sample Answer:

  • Multi‑stage builds: compile deps in builder; copy wheels/binaries.
  • Prefer distroless or minimal base; if CUDA needed, start from NVIDIA runtime‑specific images; pin digests.
  • Non‑root user, locked versions, pip --require-hashes, SBOM (e.g., syft) and scan (grype/trivy).
  • Cache: --mount=type=cache with BuildKit; .dockerignore curated.
# syntax=docker/dockerfile:1.7
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS base
RUN useradd -m runner && mkdir /app && chown runner /app
USER runner
WORKDIR /app
COPY --chown=runner:runner requirements.txt .
RUN pip install --no-cache-dir --require-hashes -r requirements.txt
COPY --chown=runner:runner . .
ENTRYPOINT ["pytest","-q","-n","auto"]

Q7. Networking inside Docker is flaky under load. Outline a senior‑level troubleshooting plan.

Sample Answer:

  • Inspect docker network ls/inspect; verify bridge MTU vs host; check conntrack table saturation.
  • Capture traffic with tcpdump at host and container veth; analyze with Wireshark.
  • Validate hairpin NAT behavior, DNS resolver timeouts, and ephemeral port exhaustion; tune /proc/sys/net/netfilter params.
  • If on K8s: verify CNI MTU (Calico/Flannel), NetworkPolicy, and kube-proxy mode.

Q8. Reproducibility across customer environments (driver + CUDA + OS). How do you encode and exchange the environment?

Sample Answer:

  • Container digests (not tags), plus full SBOM; include kernel headers/toolchain versions; pin CUDA toolkit versions.
  • Include infra manifests (docker-compose or Helm chart values), data snapshots with checksums, and seed configs.
  • Capture one‑touch repro scripts; store artifact bundle per defect.

Q9. Designing an optimal test matrix for combinatorial explosion (OS × GPU SKU × CUDA × App Type).

Sample Answer:

  • Use pairwise / orthogonal arrays to minimize runs yet cover interactions.
  • Prioritize top 80% usage combos; ensure full coverage for safety‑critical paths.
  • Keep canary lanes for rare combos; revisit quarterly with telemetry.

Q10. When to test on bare‑metal vs VM vs container?

Sample Answer:

  • Bare‑metal: performance, device/driver/kernel timings, power/thermal.
  • VM: functional coverage where paravirtualization suffices; snapshot‑based repros.
  • Container: app/user‑space tests; not ideal for kernel driver validation.

Q11. Scaling soak/stress tests and avoiding resource starvation on shared labs.

Sample Answer:

  • Admission control with resource quotas; schedule via priority classes; enforce cgroups limits.
  • Staggered starts, chaos injection windows; telemetry budget per node to avoid overhead skewing results.
  • Automatic quarantine of noisy neighbors.

Q12. Define a defect triage policy (severity vs priority), and your approach to escape analysis.

Sample Answer:

  • Severity = technical impact; Priority = business urgency. Have a Bug Bar; P0 = data loss/crash/regression.
  • Escape analysis: categorize by phase missed (spec, code review, test); implement CAPA, add regression test and checklist updates; track escaped defect rate trend.

Section 2 — Gen AI & AI Tools (Senior)

Goals: Evaluation architecture, safety, serving performance on GPUs, governance, and LLMOps in CI/CD.


Q1. You must evaluate a RAG‑powered feature. Design the evaluation plan and metrics.

Sample Answer:

  • Datasets: stratified queries by intent; gold references + citations.
  • Metrics:
    • Retrieval: Recall@k / nDCG
    • Answer: Faithfulness (citation grounded), F1/Exact match for structured Qs
    • Toxicity/safety score; latency P95/P99.
  • Human eval: rubric with inter‑annotator agreement (Cohen’s κ).
  • Gate: changes must beat baseline with statistical significance (bootstrap CI).

Q2. LLM serving performance on A100 vs H100. What do you test and how?

Sample Answer:

  • Measure throughput (tokens/s), TTFT, P95 latency across batch sizes, sequence lengths, KV‑cache on/off.
  • Compare serving stacks: Triton Inference Server, TensorRT‑LLM, vLLM; test paged attention, FP8/FP16.
  • Warm‑up runs, fixed seeds; control for NUMA/PCIe topology; monitor SM occupancy and memory bandwidth (Nsight, nvidia-smi dmon).

Q3. Hallucination mitigation and measurement.

Sample Answer:

  • Techniques: retrieval grounding, constrained decoding (JSON schema), cite sources, self‑consistency.
  • Measurement: judge models with rubrics + evidence check; sample tasks from TruthfulQA‑like sets; human spot‑checks.
  • Production: uncertainty estimation (log‑prob/entropy), fallback to declarative refusal when low confidence.

Q4. Prompt‑injection & jailbreak test plan for an LLM using tool calls.

Sample Answer:

  • Threat model: prompt‑leakage, data exfil, policy bypass.
  • Tests: adversarial strings, nested instructions, unicode homoglyphs, tool schema fuzzing.
  • Defenses: strict function schemas, allow‑listing tools, output filtering, sandboxed execution, strip/escape model‑visible tool responses.
  • Telemetry: flag attempts, blocklists, red‑team corpus regression.

Q5. Fine‑tuning vs LoRA/Adapters vs Prompt‑tuning—how do you choose, and how do you QA?

Sample Answer:

  • Full FT: best capacity; costly; risk of catastrophic forgetting.
  • LoRA/Adapters: cost‑effective; rapid iteration; good for domain style.
  • Prompt‑tuning: cheapest; good for routing/specialization.
  • QA: same eval battery + drift checks vs base model; catastrophic forgetting tests on general tasks; ablations.

Q6. Avoiding evaluation leakage and ensuring statistical validity.

Sample Answer:

  • Version datasets; keep a sealed test set; prevent test‑time prompts from including answers.
  • Randomized trials; multiple seeds; paired tests; report CIs; A/B on shadow traffic with kill‑switch & rollback.

Q7. CI/CD for GenAI features. What gates do you enforce?

Sample Answer:

  • Offline eval on golden set must exceed threshold; toxicity/safety < budget; latency P95 within SLO.
  • Canary rollout with guardrails; continuous post‑deployment eval; model/version pinning; automatic rollback on SLO breach.

Q8. Grounded generation with structured outputs. How do you get reliable JSON?

Sample Answer:

  • Use function‑calling / schema‑guided decoding; temperature≈0–0.2; stop sequences; strict JSON schema validation with pydantic; repair loop limited to 1–2 attempts with schema diffs logged.

Q9. Data governance for training/eval data (PII, consent, lineage).

Sample Answer:

  • PII detection & redaction, access via least privilege; consent tracking; dataset versioning (DVC), provenance metadata.
  • Periodic re‑scrub; legal review for retention; data deletion pipeline.

Q10. You observe response instability across deployments. How do you stabilize?

Sample Answer:

  • Fix model + tokenizer hashes, same serving build, deterministic decoding settings; seed.
  • Normalize context windows; ensure identical system prompts; pin retrieval indices; beware clock/time‑based prompts.

Q11. Cost/perf optimization tests for LLM APIs vs self‑hosted.

Sample Answer:

  • Model TCO: GPU hours, energy, admin, latency; test batching, streaming, token echo toggles, cache hits.
  • Compare quality gates on the same golden set; choose hybrid: API for general, self‑hosted for sensitive/low‑latency paths.

Q12. Ethical guardrails: how do you institutionalize safety?

Sample Answer:

  • Policy library mapped to test prompts; red‑team rotations; incident review similar to postmortems.
  • Audit trail for prompts/responses; periodic drift checks; stakeholder sign‑off before promotions.

Section 3 — Automation Framework, Programming & Kubernetes (Senior)

Goals: Architecture decisions, scalable runners, typed Python with performance awareness, K8s multi‑tenant test grids, security, and SLO‑driven rollouts.


Q1. Design an enterprise automation platform used by multiple teams. Key components and contracts?

Sample Answer:

  • Core runner (idempotent, hermetic), plugin system (discovery via entry points), test metadata (owner, tags, flake budget, SLA), artifact service, data layer (results DB), selection engine (test impact analysis), observability (OTel tracing), policy gates.
  • Contracts: stable CLI/API, result schema, versioned plugins, RBAC.

Q2. Hermetic builds & cross‑platform consistency.

Sample Answer:

  • Use containerized toolchains, lockfiles (pip-tools/poetry), reproducible builds (SOURCE_DATE_EPOCH), Bazel or tox matrices; cache via remote cache.
  • Avoid host contamination: mount read‑only, no network phase for unit tests.

Q3. Python design: implement deterministic test sharding for K8s workers.

Sample Answer:

from pathlib import Path
from hashlib import sha256
from typing import Iterable, List
 
def list_tests(root: Path) -> List[str]:
    return sorted(str(p) for p in root.rglob("test_*.py"))
 
def shard_of(test_id: str, total: int) -> int:
    h = sha256(test_id.encode("utf-8")).hexdigest()
    return int(h, 16) % total
 
def assign_shard(tests: Iterable[str], shard_index: int, total: int) -> List[str]:
    return [t for t in tests if shard_of(t, total) == shard_index]
 
if __name__ == "__main__":
    tests = list_tests(Path("tests"))
    shard = int(os.environ.get("SHARD_INDEX", "0"))
    total = int(os.environ.get("SHARD_TOTAL", "1"))
    to_run = assign_shard(tests, shard, total)
    print("\n".join(to_run))
  • Deterministic (hash stable), balanced statistically, works across runners; store mapping as artifact for repro.

Q4. Make parallel tests reliable.

Sample Answer:

  • Shared state eliminated; per‑test temp dirs (tmp_path), unique resource names, randomized ports; timeouts & cancellation; eventual consistency waits.
  • Use async IO or process pools for true parallelism; ensure thread safety in clients.

Q5. Contract testing vs E2E—senior stance and rollout.

Sample Answer:

  • Consumer‑driven contracts reduce E2E volume; wire up provider verification in CI; schema evolution rules; backward compatibility window.
  • Keep thin E2E for cross‑service flows; measure defect catch by tier.

Q6. Observability in test runs (distributed tracing).

Sample Answer:

  • Propagate trace/span IDs from test to services; OTel exporters; correlation between CI job and application traces.
  • Alert on test SLO (latency, pass rate), annotate releases on dashboards.

Q7. K8s test grid architecture.

Sample Answer:

  • Use Jobs for test shards; StatefulSets for backing stores (e.g., test DB); nodeSelector/taints for GPU nodes; resource requests/limits tuned.
  • PriorityClasses for urgent paths; HPA/KEDA to scale by queue depth.

Q8. K8s security and secrets.

Sample Answer:

  • Minimal RBAC service accounts, PodSecurity (restricted), securityContext (runAsNonRoot, readOnlyRootFilesystem).
  • Secrets via External Secrets/vault; network policies default‑deny; image signing (cosign).

Q9. Ephemeral preview environments per PR.

Sample Answer:

  • Namespace per PR; declarative via Helm/Kustomize; seeded data; TTLAfterFinished for Jobs; GC on merge.
  • Synthetic monitors run post‑deploy and gate merge.

Q10. Scale strategy: 5,000 E2E tests take 3h. Cut to <45m without quality loss.

Sample Answer:

  • Push to contract/API tests; slice into suites by risk; parallelize with 100+ shards; cache heavy fixtures; test impact analysis to run only affected tests; service virtualization for flakiest deps; enforce flake budget.

Q11. Failure forensics for pods.

Sample Answer:

  • kubectl describe, kubectl logs --previous, ephemeral containers for debug, dump /proc, coredumps to PV; capture node diagnostics via DaemonSet.

Q12. Example K8s Job template for sharded tests.

Sample Answer (YAML):

apiVersion: batch/v1
kind: Job
metadata:
  name: test-shard-{{index}}
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: runner
        image: registry.example.com/qa/runner@sha256:...
        env:
        - name: SHARD_INDEX
          value: "{{index}}"
        - name: SHARD_TOTAL
          value: "100"
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
        securityContext:
          runAsNonRoot: true
          readOnlyRootFilesystem: true

Section 4 — General / Problem Solving (Senior)

Goals: Decision‑making under uncertainty, system thinking, leadership, and crisp communication.


Q1. Go/No‑Go call with incomplete perf data and known P2s. How do you decide?

Sample Answer:

  • Map to risk register: quantify blast radius, detectability, reversibility.
  • If rollback is fast & safe, allow canary with real‑time SLO monitors; otherwise No‑Go with explicit action plan.
  • Document assumptions, owner, and review date.

Q2. Post‑release outage RCA framework.

Sample Answer:

  • Timeline + evidence; 5 Whys + fishbone; identify control weaknesses (tests, reviews, feature flags).
  • CAPA: add checks, tests, alerts; assign accountable owners, due dates; communicate broadly.

Q3. Define senior QA metrics that matter.

Sample Answer:

  • Leading: PR test pass rate, flake rate, change failure probability, time to detect.
  • Lagging: escaped defects, customer‑found defects/time, availability SLOs.
  • Tie to business outcomes; prevent vanity metrics.

Q4. Debugging a suspected Python deadlock in an automation controller.

Sample Answer:

  • Enable faulthandler; send SIGQUIT to dump stacks; inspect threading.enumerate(); look for lock cycles.
  • Reproduce under PYTHONFAULTHANDLER=1, use py-spy or gdb for native waits; consider GIL vs I/O; move to multiprocessing or async.

Q5. Distributed system drops messages; duplicates appear. How to make tests robust?

Sample Answer:

  • Assume at‑least‑once delivery; enforce idempotent handlers; use dedupe keys; test with fault injection (retries, reorders).
  • Verification via eventual consistency with bounded waits + reconciliation checks.

Q6. Logical puzzle (pressure under time): You have 2 GPUs and limited lab time; how to maximize confidence in 10 permutations of OS×Driver×CUDA?

Sample Answer:

  • Prioritize with pairwise coverage and usage data; run smoke + perf canaries on all permutations; deep soak on top 2 risk combos; parallelize via staggered schedules and reuse caches.

Q7. Build vs Buy for a test data management tool.

Sample Answer:

  • TCO (initial + maintenance), time‑to‑value, compliance, extensibility, vendor lock‑in.
  • Pilot a SaaS with sensitive data excluded; build minimal plugin layer to avoid lock‑in.

Q8. Leading through disagreement—devs dispute your P1.

Sample Answer:

  • Bring evidence (repros, traces, user impact), align on definition of done/bug bar; propose experiment/canary to de‑risk; escalate only with facts and options.

Q9. Designing a self‑serve QA platform for the org.

Sample Answer:

  • Templates (service scaffolds), golden pipelines, shared runners, observability baked in, policy as code; office hours and docs; success measured by adoption and lead time.

Q10. Security‑minded QA additions.

Sample Answer:

  • Add SAST/DAST, secrets scanning, dependency scanning with SBOM, sigstore verification, container CIS benchmarks; threat modeling (STRIDE) at feature kickoff; integrate security tests into CI.

Q11. Communicating to execs—1‑page Go/No‑Go. What’s in it?

Sample Answer:

  • Context, what changed, risk & mitigations, SLO impact, test coverage gaps, recommendation with options (ship/canary/hold), owner & timeline.

Q12. Upleveling the team.

Sample Answer:

  • Define career ladders, mentoring, guilds, brown bags; quarterly goals (flake <0.5%, e2e cut 30%); publish scorecards; celebrate outcomes.

Want a printable guide?

I can package this into a single Markdown/PDF with a clean layout (sections, timing guide, checklists).

Would you like me to:

  • A) Generate a Markdown file you can download, or
  • B) Create a Google Doc/Word style export with a contents page and interview scoring rubrics?