Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于从kafka消费到的实时数据如何在存入存储介质前去重的疑问 #18

Open
hemaGitHub opened this issue Mar 19, 2018 · 2 comments

Comments

@hemaGitHub
Copy link

场景是:你的程序负责从kafka中按topic消费,数据来源是logstash采集的各个业务组生成的数据和日志。
要求是你的程序从kafka消费后,到写入存储前数据必须去重,怎么做到呢?
请大家给点思路谢谢

@dearshor
Copy link

判断重复的依据是啥?假设有明确的依据,假设你用Java,那么把你的数据和日志,放到一个合理地重写了equals方法的类的实例里,把这些实例加入一个HashSet,就达到去重的目的了,简单粗暴😁 如果是并发环境,可以用相应的并发版本的Set,比如ConcurrentSkipListSet

@zhengzhuangjie
Copy link

去重的时间范围是多长?可以在logstash和kafka之间加一层缓存,或者在kafka被消费后执行去重操作

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants