pwshub.com

Creating a 3D Hand Controller Using a Webcam with MediaPipe and Three.js

I love interactive things, but I’m not a fan of expensive devices, especially those we have to buy year after year to have the latest hardware in our hands!

I’m a fan of accessible technology!

With that in mind, today I’m going to show you how to control elements in a 3D scene using just your hands, a webcam, and a web browser. The key focus here is converting a 2D screen into a 3D space, with full depth control. I’ll keep the focus on that!

By the way, here you can try a nice implementation of this controller in an Endless Spaceship Game.

Setup

First, we need to create our @mediapipe/hands instance to detect the hand landmarks.

export class MediaPipeHands {
  constructor(videoElement, onResultsCallback) {
    const hands = new Hands({
      locateFile: (file) => {
        return `https://cdn.jsdelivr.net/npm/@mediapipe/hands/${file}`;
      },
    });
    hands.setOptions({
      maxNumHands: 1,
      modelComplexity: 1,
      minDetectionConfidence: 0.5,
      minTrackingConfidence: 0.5,
    });
    hands.onResults(onResultsCallback);
    this.camera = new Camera(videoElement, {
      async onFrame() {
        await hands.send({ image: videoElement });
      },
      width: 1280,
      height: 720,
    });
  }
  start() {
    if (this.camera) this.camera.start();
  }
}

I have created a class called MediaPipeHands that contains this implementation, so we can easily add it to the code and reuse it anywhere.

Like this:

this.mediaPiepeHands = new MediaPipeHands(videoElement, (landmarks) =>
  this.onMediaPipeHandsResults(landmarks)
);
this.mediaPiepeHands.start();

The onMediaPipeHandsResults is a callback from the library that returns the mentioned landmarks, it comes from hands.onResults.

Now, let’s create our WebGL scene with Three.js, only to have some elements we can interact with.
Nothing special, a simple Three.js scene 😉

Here’s another class for rapid prototyping, a plug-and-play static class that can be accessed anywhere in your code. Keep in mind that it doesn’t follow any standard, it’s just to make my life easier.

import {
  Scene,
  PerspectiveCamera,
  Color,
  WebGLRenderer,
  AmbientLight,
  DirectionalLight,
  SpotLight,
  Clock,
} from "three";
export class ScenesManager {
  static scene;
  static camera;
  static renderer;
  static clock;
  static setup() {
    ScenesManager.scene = new Scene();
    ScenesManager.scene.background = new Color(0xcccccc);
    ScenesManager.camera = new PerspectiveCamera(
      45,
      window.innerWidth / window.innerHeight,
      0.01,
      100
    );
    ScenesManager.camera.position.set(0, 0, 2);
    ScenesManager.clock = new Clock();
    ScenesManager.renderer = new WebGLRenderer({ antialias: true });
    ScenesManager.renderer.setSize(window.innerWidth, window.innerHeight);
    ScenesManager.renderer.setPixelRatio(window.devicePixelRatio);
    ScenesManager.renderer.shadowMap.enabled = true;
    const ambLight = new AmbientLight(0xffffff, 1);
    ScenesManager.scene.add(ambLight);
    ScenesManager.renderer.setSize(window.innerWidth, window.innerHeight);
    document.body.appendChild(ScenesManager.renderer.domElement);
  }
  static render() {
     ScenesManager.renderer.render(ScenesManager.scene, ScenesManager.camera);
  }
}

Only import it and initialize it to have a Scene:

import { ScenesManager } from "./ScenesManager.js";
.
.
.
ScenesManager.setup();

Creating the controls

Let’s start using our landmarks, it is cartesian coordinates that come from @mediapipe/hands I have mentioned.

Here is the reference so you know what I’m talking about and understand the values we will use.

To move our elements on the x/y-axis I’ll select only one point, 9. MIDDLE_FINGER_MCP, my choice is based only on the position of this coordinate, it is at the center of everything and makes sense.
At least for me 😅

We are using only one hand in this exploration, so the first 21 indexes of the landmark matter.
If you want to use two hands, use the same array and read values starting from index 21.

if (landmarks.multiHandLandmarks.length === 1) {
    for (let l = 0; l < 21; l++) {
       this.handsObj.children[l].position.x = -landmarks.multiHandLandmarks[0][l].x + 0.5;
       this.handsObj.children[l].position.y = -landmarks.multiHandLandmarks[0][l].y + 0.5;
       this.handsObj.children[l].position.z = -landmarks.multiHandLandmarks[0][l].z;
    }
}

This is the result:

As you can notice, landmark points have the same scale even in a 3D space, it is because @mediapipe/hands have no proper z-axis, it is 2D, so here the magic starts to happen! ⭐️

Yes, you are right, it is not magic, but I like this term! 😅

Z-Depth: 2D to 3D conversion

My idea is to get 2 landmark points, calculate the distance between them in 2D space, then apply it as depth.

As you can notice in the video above, the cursor (big sphere), already moves in the correct direction opposite the landmarks.

To make it happen I selected point 0. WRIST and 10. MIDDLE_FINGER_PIP from landmarks.

this.refObjFrom.position.copy(this.gestureCompute.depthFrom);
const depthA = this.to2D(this.refObjFrom);
this.depthPointA.set(depthA.x, depthA.y);
this.refObjTo.position.copy(this.gestureCompute.depthTo);
const depthB = this.to2D(this.refObjTo);
this.depthPointB.set(depthB.x, depthB.y);
const depthDistance = this.depthPointA.distanceTo(this.depthPointB);
this.depthZ = THREE.MathUtils.clamp(
   THREE.MathUtils.mapLinear(depthDistance, 0, 1000, -3, 5), 
   -2, 
   4
);

I limit this value between -2 and 4 to make it friendly but it is not necessary, it is all about your feelings as a user.

So to move in the correct direction, we need to invert this distance to a negative value, so when the hand is near the camera the cursor is far from the camera.
Like this:

this.target.position.set(
  this.gestureCompute.from.x,
  this.gestureCompute.from.y,
  -this.depthZ
);

Gestures

The same logic I have used to compute closed_fist gesture.
As it is a unique gesture we need to grab an object, we don’t need to import another dependency like GestureRecognizer. It will save load time and memory usage.

So I got 9. MIDDLE_FINGER_MCP and 12. MIDDLE_FINGER_TIP and applied the same approach, based on the distance my hand is closed or not!

this.gestureCompute.from.set(
   -landmarks.multiHandLandmarks[0][9].x + 0.5,
   -landmarks.multiHandLandmarks[0][9].y + 0.5,
   -landmarks.multiHandLandmarks[0][9].z
).multiplyScalar(4);
this.gestureCompute.to.set(
   -landmarks.multiHandLandmarks[0][12].x + 0.5,
   -landmarks.multiHandLandmarks[0][12].y + 0.5,
   -landmarks.multiHandLandmarks[0][12].z
).multiplyScalar(4);
const pointsDist = this.gestureCompute.from.distanceTo(
   this.gestureCompute.to
);
this.closedFist = pointsDist < 0.35;

Collision Test

Now we need to make it funnier (for me, all this is fun lol), so let’s calculate the collision between our cursor and objects and start grabbing them.

I wrote a simple AABB test because it was more accurate and performant during my testing rather than using a raycaster.
But remember, every case is one case, so perhaps it will perform better with another algorithm for you or even with a Rasycater!

this.targetBox3.setFromObject(this.target); // Target is my cursor
this.objects.forEach((obj) => {
  this.objectBox3.setFromObject(obj);
  const targetCollision = this.targetBox3.intersectsBox(this.objectBox3);
  if (targetCollision) {
    // Do something...I did a drag and drop interaction.
  }
});

So that’s our result:

Check out the final demo.

You can preview it and browse through the code here.

Final words

I hope you enjoy it as much as I did! It was a lot of fun and enriching for me.

We finally have good hand tracking in the browser. This hasn’t been possible for a while. I’m not talking about the technology itself, which has been around for a long time, but the performance wasn’t as good as nowadays!

With this, we have a universe of possibilities to explore and the chance to create interesting experiences for any audience. I believe in and defend technology for everyone.

Thank you very much for reading this and see you next time!

Source: tympanus.net

Related stories
2 weeks ago - VTubing is a type of live streaming in which the streamer uses a virtual avatar in 2D or 3D instead of their real identity or personality. This affords the VTubers (read: streamers) more freedom of expression and anonymity online in an...
6 days ago - Infinite runner games have been a favorite for gamers and developers alike due to their fast-paced action and replayability. These games often feature engaging mechanics like endless levels, smooth character movement, and dynamic...
1 month ago - This comprehensive guide shows how to use CSS transitions! A back-to-basics look at the fundamental building blocks we need to create microinteractions and other animations.
1 month ago - As software developers, we're always learning new things; it's practically the whole gig! If we can learn to quickly pick up new languages/frameworks/tools, we'll become so much more effective at our job. It's sort of a superpower.
3 weeks ago - AI design tool is software that provides easier alternatives to traditional workflows, such as a chat interface and/or a drag-and-drop editor. These tools assist or completely automate the design process by suggesting layouts, generating...
Other stories
6 minutes ago - New CSS features simplify animating elements by allowing transitions for display and size properties, reducing the need for complex JavaScript workarounds. The post Simplifying CSS animations with the display and size properties appeared...
6 minutes ago - Sunsetting a product and product decommissioning are complex processes that require strategic planning and effective communication. The post A guide to feature sunset and product decommissioning appeared first on LogRocket Blog.
1 hour ago - A successful product launch can make or break a business. One of the biggest challenges when launching a new product or startup company is ensuring its visibility among potential customers and clients. However, product launch platforms...
1 hour ago - I’ve used plenty of webcams, mostly for work meetings and catching up with family. They usually get the job done, but finding one that really stands out is rare. That’s when I got my hands on the EMEET SmartCam S800-or as they call it, a...
1 hour ago - In recent years, remote work has become more than just a convenient alternative. Advances in technology and evolving workplace culture have made it possible for more people to enjoy the benefits and freedom of working remotely....