With the advent of mobile devices, the experience sampling method (ESM) is increasingly used as a convenient and effective way to capture user behaviors of, and evaluate mobile and environment-context dependent applications. Like any field based in situ testing methods, ESM is prone to biases from unreliable and unbalanced data, especially for A/B testing situations. Mitigating such effects can in turn incur significant costs in terms of the number of participants and sessions, and prolonged experimental time. In fact, ESM has rarely been applied to A/B testing nor do existing literatures reveal its operational details and difficulties. In this paper, as a step toward establishing concrete guidelines, we describe a case study of applying ESM to evaluating two competing interfaces for a mobile application. Based on the gathered data and direct interviews with the participants, we highlight the difficulties experienced and lessons learned. In addition, we make a proposal for a new ESM in which the experimental parameters are dynamically reconfigured based on the intermediate experimental results to overcome the aforementioned difficulties.