This paper presents a real-time framework for computationally tracking objects visually attended by the user while navigating in interactive virtual environments. In addition to the conventional bottom-up (stimulus-driven) features, the framework also uses topdown (goal-directed) contexts to predict the human gaze. The framework first builds feature maps using preattentive features such as luminance, hue, depth, size, and motion. The feature maps are then integrated into a single saliency map using the center-surround difference operation. This pixel-level bottom-up saliency map is converted to an object-level saliency map using the item buffer. Finally, the top-down contexts are inferred from the user's spatial and temporal behaviors during interactive navigation and used to select the most plausibly attended object among candidates produced in the object saliency map. The computational framework was implemented using the GPU and exhibited extremely fast computing performance (5.68 msec for a 256X256 saliency map), substantiating its adequacy for interactive virtual environments. A user experiment was also conducted to evaluate the prediction accuracy of the visual attention tracking framework with respect to actual human gaze data. The attained accuracy level was well supported by the theory of human cognition for visually identifying a single and multiple attentive targets, especially due to the addition of top-down contextual information. The framework can be effectively used for perceptually based rendering without employing an expensive eye tracker, such as providing the depth-of-field effects and managing the level-of-detail in virtual environments.