ihda06

Posted on Apr 21

Using Object Detection Model YOLOV7 with Tensorflow.Js

#react #webdev #tensorflow #yolov7

1. Introduction

A few weeks ago, I was assigned a task at work involving object detection. As someone who primarily works on the frontend, I became curious was it possible to implement an object detection model directly in a browser based React app, without relying on a backend or Python-based inference?

This blog post is a continuation of that research. It documents the step-by-step process I took to run a YOLOv7 model using TensorFlow.js within a React project. Along the way, I encountered several technical challenges particularly around model conversion and client-side rendering —that I believe are worth sharing.

My goal is to make this post useful for fellow developers who are exploring the same idea or simply want to integrate machine learning into their frontend applications. I’ll walk you through everything from model conversion, preprocessing, inference, to displaying the results in a browser.

Let’s get started.

2. What is YOLO and Why TensorFlow.js?

🧠 A Quick Overview of YOLO

YOLO (You Only Look Once) is a well-known family of real-time object detection models. It became popular for its ability to detect multiple objects in a single forward pass - making it fast and efficient for applications like surveillance, robotics, and real-time analytics.

Over time, YOLO has evolved into several versions maintained by different contributors:

YOLOv3 & YOLOv4: Older but still widely used, lightweight, and efficient
YOLOv5, v6, v8, v11: Developed and maintained by Ultralytics, offering better tooling and performance improvements
YOLOv7: Developed by WongKinYiu, widely appreciated for its balance of accuracy and speed, and considered one of the most stable and community-driven versions

⚖️ Why Licensing Matters (and Why You Should Care)

When working with open-source models, licensing is not just a legal formality - it determines how you can use, share, or deploy that model. And in many real-world cases, misuse of licenses (even unintentionally) can cause serious issues, especially in commercial settings.

Here's a brief overview:

🔹 AGPLv3 (used by Ultralytics for YOLOv5+):

If you use this in a public-facing app, you're required to open-source your entire application, including any code that interacts with the model even if you didn't modify the model itself.

🔹 YOLOv4:

Released under a custom license that explicitly restricts commercial use, which makes it risky to use in production unless you've obtained special permission.

🔹 YOLOv3 and YOLOv7:

These are safer choices for projects that may eventually be used commercially or shared publicly. YOLOv7, in particular, offers excellent performance without restrictive licensing.

🛑 Note: Always double-check the license of any model you use don't treat open source as "free to use without conditions." It's better to be cautious than to deal with legal issues later on.

🌐 Why TensorFlow.js?

To run the model entirely in the browser, I used TensorFlow.js, a JavaScript library that brings machine learning to the web.

Why TensorFlow.js?

No backend or server needed
Seamless integration with React
GPU acceleration via WebGL
Ideal for building lightweight prototypes, interactive tools, and real-time demos

In this project, TensorFlow.js allowed me to take a fully trained YOLOv7 model, convert it, and run object detection directly in a React app - no Python, no API calls, no external inference servers.

3. Converting YOLOv7 to TensorFlow.js

Most pre-trained YOLO models - like YOLOv7 - are built in PyTorch, which can't be used directly in the browser. To make it work with TensorFlow.js, we need to convert the model through several stages. Each step transforms the model into a format that gets us closer to running it in the browser.

Below is the step-by-step pipeline I used:

🔄 Conversion Flow

Step 1: Get and Export YOLOv7 from PyTorch to ONNX

First, I used the official YOLOv7 export script to convert the .pt model file into the ONNX format.

Get model from the official repository:
Official repository YOLOV7 by WongKinYiu

!# Download trained weights
!wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-tiny.pt

Export YOLOV7 Model to ONNX

!python export.py --weights ./yolov7-tiny.pt \
        --grid --end2end --simplify \
        --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 \
        --img-size 640 640 --max-wh 640 # For onnxruntime, you need to specify this value as an integer, when it is 0 it means agnostic NMS,
                     # otherwise it is non-agnostic NMS

The result is a .onnx file containing the YOLOv7 model structure and weights

📎 See the full notebook: here

Step 2: Convert ONNX to TensorFlow.js

Next, I converted the ONNX model into TensorFlow’s SavedModel format using onnx2tf.

# Convert ONNX to TensorFlow SavedModel using onnx2tf
!python -m onnx2tf -i best2.onnx -ois input:1,3,640,640 -osd -dgc

# Convert SavedModel to TensorFlow.js (tfjs)
!tensorflowjs_converter \
    --input_format=tf_saved_model \
    --output_format=tfjs_graph_model \
    saved_model \
    tfjs_model

This creates a folder tfjs_model and saved_model

📎 Reference: TFJS Converter Docs

4. Integrating the Model into ReactJS

With the model converted and ready to be used in the browser, the next step is integrating it into a React application. For this project, I used React with Vite, along with Tailwind CSS for UI, and @tensorflow/tfjs for inference.

Here’s a breakdown of how I structured the integration:

🧱 Project Setup

First, I initialized the project with Vite and installed necessary dependencies:

npm create vite@latest object-detection-yolo --template react-ts
cd object-detection-yolo
npm install

Then I installed TensorFlow.js:

npm install @tensorflow/tfjs

Optional: Tailwind CSS for styling

npm install -D tailwindcss postcss autoprefixer
npx tailwindcss init -p

⚙️ Loading and Preparing the Model in React

In the React application, I used a useEffect() hook to load the model and prepare it for inference as soon as the component mounts. This process includes downloading the model, warming it up, and storing relevant metadata in the component’s state.

useEffect(() => {
    tf.ready().then(async () => {
      const yolov8 = await tf.loadGraphModel(
        `${window.location.origin}/yolov7tiny_web_model/model.json`,
        {
          onProgress: (fractions) => {
            setLoading({ loading: true, progress: fractions }); // set loading fractions
          },
        }
      ); // load model

      if (!yolov8) return;

      // warming up model
      const dummyInput = tf.ones(yolov8.inputs[0].shape!);
      const warmupResults = await yolov8.executeAsync(dummyInput);

      setLoading({ loading: false, progress: 1 });
      setModel({
        net: yolov8,
        inputShape: yolov8.inputs[0].shape ?? [1, 0, 0, 3],
      }); // set model & input shape

      tf.dispose([warmupResults, dummyInput]); // cleanup memory
    });
  }, []);

Key Steps Explained:

Model Loading: The model is loaded using tf.loadGraphModel() from the local public directory. The progress of the model loading is tracked using the onProgress callback to show a loading indicator on the UI.
Warm-Up Step: Before performing any actual detection, the model is run once with a dummy input (tf.ones(...)) that matches its input shape. This “warms up” the model by initializing memory and caching computation graphs, which helps reduce lag on the first real inference.
Set Model in State: Once the model is ready, it’s stored in the component’s state using setModel, along with its expected input shape. This makes the model available to other parts of the app for processing images or video.
Memory Management: Temporary tensors used during the warm-up are disposed of using tf.dispose() to avoid memory leaks—especially important in browser-based apps where resources are limited.

This entire lifecycle setup ensures that the model is loaded efficiently and ready for real-time inference as soon as the user interacts with the app.

🎨 Preprocessing the Input

Before passing an image or video frame into the model, it needs to be preprocessed to match the model’s expected input format. In this case, the YOLOv7 model (converted to TensorFlow.js) expects an input shape of [1, 640, 640, 3], meaning a single RGB image with dimensions 640×640 pixels.

Here’s how the preprocess() function handles that:

const preprocess = (
  source:
    | tf.PixelData
    | ImageData
    | HTMLImageElement
    | HTMLCanvasElement
    | HTMLVideoElement
    | ImageBitmap,
  modelWidth: number,
  modelHeight: number,
) => {
  const { input, xRatio, yRatio } = tf.tidy(() => {
    const img = tf.browser.fromPixels(source)

    // padding image to square => [n, m] to [n, n], n > m
    const [h, w] = img.shape.slice(0, 2) // get source width and height
    const maxSize = Math.max(w, h) // get max size
    const imgPadded = img.pad([
      [0, maxSize - h], // padding y [bottom only]
      [0, maxSize - w], // padding x [right only]
      [0, 0],
    ]) as tf.Tensor<tf.Rank.R3>

    const xRatio = maxSize / w // update xRatio
    const yRatio = maxSize / h // update yRatio
    const input = tf.image
      .resizeBilinear(imgPadded, [modelWidth, modelHeight]) // resize frame
      .div(255.0) // normalize
      .expandDims(0) // add batch
    return {
      input: input,
      xRatio: xRatio,
      yRatio: yRatio,
    }
  })

  return { input, xRatio, yRatio }
}

What it does:

Converts the input to a tensor: The image, canvas, or video frame is converted to a TensorFlow tensor using tf.browser.fromPixels().
Pads the image to make it square: Since many real-world images are rectangular, the function calculates the larger of the two dimensions (height or width) and pads the shorter side so that the image becomes a square. This avoids distortion when resizing later.
Calculates scale ratios: The original aspect ratio is preserved by storing the horizontal (xRatio) and vertical (yRatio) scaling factors. These will later be used to map bounding box coordinates back to the original image size.
Resizes and normalizes the image: The square image is resized to the model’s expected dimensions (modelWidth × modelHeight), normalized to values between 0 and 1, and expanded to include the batch dimension.
Memory-safe execution with tf.tidy(): The entire process is wrapped in tf.tidy() to automatically dispose of intermediate tensors and prevent memory leaks in the browser.

Output:
The function returns:

input: the preprocessed image tensor ready to be passed to the model
xRatio and yRatio: scaling factors to restore original coordinate positions later during post-processing

This preprocessing step ensures that any input image, video frame, or canvas can be fed into the model without shape mismatch errors, while also preserving spatial accuracy for rendering detection results.

🔍 Running Inference and Rendering the Result

Once the input image is preprocessed, the next step is to run it through the model and render the detection results. This is handled by the detect2() function, which performs inference, processes the output, and visualizes the detected objects on a element.

export const detect2 = async (
  source:
    | tf.PixelData
    | ImageData
    | HTMLImageElement
    | HTMLCanvasElement
    | HTMLVideoElement
    | ImageBitmap,
  model: { net: tf.GraphModel<string | tf.io.IOHandler>; inputShape: number[] },
  treshold: number,
  canvasRef: HTMLCanvasElement,
  callback = () => {},
) => {
  const [modelWidth, modelHeight] = model.inputShape.slice(1, 3) // get model width and height

  tf.engine().startScope() // start scoping tf engine
  const { input, xRatio, yRatio } = preprocess(source, modelWidth, modelHeight) // preprocess image

  const res = (await model.net.executeAsync(input)) as tf.Tensor<tf.Rank.R2> // inference model

  const dets = res.arraySync()

  renderBoxesSimple(canvasRef, dets, [xRatio, yRatio], treshold)

  tf.dispose([res]) // clear memory

  callback()

  tf.engine().endScope() // end of scoping
}

What the function does:

Extracts model dimensions: The model’s expected input width and height are taken from its inputShape and passed to the preprocessing function.
Starts a memory scope: tf.engine().startScope() is called to ensure that any tensors created within this block are tracked and can be cleaned up afterward. This is important for long-running apps like webcam feeds, where unmanaged memory usage can grow rapidly.
Preprocesses the input: The input (image, video, canvas, etc.) is processed using the preprocess() function, which returns a normalized, padded, and resized tensor along with the scaling ratios needed to map detections back to the original image.
Runs model inference: The preprocessed input is passed to executeAsync(), which returns a prediction tensor. This tensor contains the raw detection results: bounding boxes, class IDs, and confidence scores.
Processes output and renders detections: The output tensor is converted to a JavaScript array with arraySync() and passed to a custom rendering function (renderBoxesSimple). This function draws the bounding boxes and labels directly onto the canvas using the correct scale and position.
Cleans up memory: After inference is complete, the result tensor is disposed using tf.dispose(), and the scope is ended with tf.engine().endScope() ensuring all temporary tensors are released.
Executes optional callback: A callback can be provided to trigger any additional logic after the detection is complete (e.g., logging, UI updates, analytics).

Summary:
This function acts as the main detection loop. It takes an image, processes it, feeds it into the model, and then displays the result all within a memory-safe scope. It’s designed to be reused in real-time pipelines, like webcam-based detection systems or image upload flows.

🖍️ Rendering the Bounding Boxes and Labels on Canvas

Once the model produces detection results, the final step is to visualize them. This is handled by the renderBoxesSimple() function, which draws bounding boxes and corresponding class labels onto an HTML <canvas>.

export const renderBoxesSimple = (
  canvasRef: HTMLCanvasElement,
  boxes_data: number[][],
  ratios: number[],
  threshold: number,
) => {
  const ctx = canvasRef.getContext('2d')
  if (!ctx) return
  ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height) // clean canvas

  // font configs
  const font = `${Math.max(
    Math.round(Math.max(ctx.canvas.width, ctx.canvas.height) / 40),
    14,
  )}px Arial`
  ctx.font = font
  ctx.textBaseline = 'top'
  if (!ctx) return

  boxes_data.forEach((det) => {
    // eslint-disable-next-line @typescript-eslint/no-unused-vars
    const [_, x0, y0, x1, y1, cls_id, score] = det

    if (score < threshold / 100) return
    const [xRatio, yRatio] = ratios
    // Konversi koordinat ke ukuran gambar asli
    const origX0 = x0 * xRatio
    const origY0 = y0 * yRatio
    const origX1 = x1 * xRatio
    const origY1 = y1 * yRatio
    const colors = new Colors()
    const color = colors.get(cls_id)
    // Gambar background
    ctx.fillStyle = Colors.hexToRgba(color, 0.2)!
    ctx.fillRect(origX0, origY0, origX1 - origX0, origY1 - origY0)
    // Gambar kotak (bounding box)
    ctx.strokeStyle = color
    ctx.lineWidth = 2
    ctx.strokeRect(origX0, origY0, origX1 - origX0, origY1 - origY0)

    // Draw the label background.
    ctx.fillStyle = color
    const text = `${labels[cls_id]}: ${score.toFixed(2)}%`
    const textWidth = ctx.measureText(text).width
    const textHeight = parseInt(font, 10) // base 10
    const yText = origY0 - (textHeight + ctx.lineWidth)
    ctx.fillRect(
      origX0 - 1,
      yText < 0 ? 0 : yText, // handle overflow label box
      textWidth + ctx.lineWidth,
      textHeight + ctx.lineWidth,
    )

    // Draw labels
    ctx.fillStyle = '#ffffff'
    ctx.fillText(text, origX0 - 1, yText < 0 ? 0 : yText)
  })
}

What the function does:

Prepares the canvas

It starts by getting the canvas rendering context (ctx) and clearing any existing drawings using clearRect().
The font size is set dynamically based on the canvas size to ensure label text scales appropriately.

Iterates through detection results

For each detection in boxes_data, the function extracts the bounding box coordinates (x0, y0, x1, y1), class ID (cls_id), and confidence score (score).
If the confidence score is below the defined threshold, the detection is skipped.

Scales bounding boxes

Coordinates are rescaled back to the original image dimensions using the xRatio and yRatio values obtained during preprocessing.

Draws bounding boxes and background

A semi-transparent background is drawn to highlight the detected object.
A colored border (stroke) is rendered around the object using a consistent color assigned to the class ID.

Adds labels

A solid background is drawn behind the label text to improve readability.
The label includes the class name and confidence score, and is positioned just above the bounding box.
White text (#ffffff) is used for high contrast.

Color management

The function uses a helper class Colors() (not shown here) to assign consistent, visually distinct colors for each class.

Example output:

A green box around a person with the label person: 94.23%
A blue box around a car with the label car: 88.17%

This function ensures that detection results are not just computed, but clearly and professionally visualized — making it useful for demos, prototypes, and real-time visual feedback in the browser.

5. Results and Performance

After integrating everything, I was able to run real-time object detection entirely in the browser using a React app — no server-side processing, no backend API, and no Python code involved at runtime.

✅ What Worked Well

Client-side inference with TensorFlow.js worked surprisingly well for images and short video clips.
Bounding boxes and labels rendered cleanly on top of a canvas element, with consistent performance.
Warm-up step noticeably improved initial response time, avoiding delays on first detection.
The model ran on WebGL acceleration, making it fairly efficient even on mid-range laptops.

🖼️ Visual Output

I tested the system on a variety of images with multiple objects. The model was able to:

Detect and classify multiple objects with reasonable accuracy
Adjust bounding boxes according to the original image ratio
Display real-time updates when used with webcam or video input

If you’re curious to try it yourself:
🚀 Live Demo

⚠️ Limitations and Considerations

As with any frontend-only machine learning project, there are trade-offs:

Browser memory usage can spike, especially with large input images or repeated inference
Model size and load time: The TFJS model (~30–50MB) can take a few seconds to download depending on connection
Performance varies: On mobile or low-end devices, detection can lag or cause dropped frames
Output format from YOLOv7 required some adjustment to interpret correctly in TensorFlow.js

That said, for prototyping, learning, and lightweight client-side ML applications — this approach works surprisingly well.

6. Final Thoughts

This project started as part of a task at work, but it quickly grew into a deeper exploration of what’s possible with machine learning on the frontend. Running an object detection model like YOLOv7 directly in a browser — without any backend — might not be the most common approach, but it’s a powerful proof-of-concept that opens up a lot of possibilities.

Along the way, I faced several challenges — from converting the model across formats to adapting the output for frontend rendering. But those obstacles were exactly what made this process meaningful — and now, I hope, useful for others too.

If you’re a frontend developer curious about AI, or someone working on rapid prototyping with limited backend infrastructure, I hope this guide provides both inspiration and practical guidance.

🔗 Resources Recap

GitHub Repo: github.com/ihda06/object-detection-yolo
Live Demo: object-detection-yolo-ihda.vercel.app

Model Conversion (Colab):

If you found this helpful, feel free to share it or fork the repo.
And if you’re working on something similar — I’d love to connect, collaborate, or just chat.

Thanks for reading 🙌

DEV Community