Multicast on-chip communication is encountered in various cache-coherence protocols targeting multi-core processors, and its pervasiveness is increasing due to the proliferation of machine learning accelerators. In-network handling of multicast traffic imposes additional switching-level restrictions to guarantee deadlock freedom, while it stresses the allocation efficiency of Network-on-Chip (NoC) routers. In this work, we propose a novel NoC router microarchitecture, called SmartFork, which employs a versatile and cost-efficient multicast packet replication scheme that allows the design of high-throughput and low-cost NoCs. The design is adapted to the average branch splitting observed in real-world multicast routing algorithms. Compared to state-of-the-art NoC multicast approaches, SmartFork is demonstrated to yield higher performance in terms of latency and throughput, while still offering a cost-effective implementation.