Cough, cough. The sink is cool. Human mods are cool. Can't live without them.
But archiving is... well... you know... kind of a shitty thing to do. It's funny at times, brings up memories, sometimes gems, but...
Expert systems are used in the industry. They are a qustionable form of AI, but an effective backbone to big big companies.
They're based on the concept of a "knowledge base".
A facts base which says "Thread with DG are cool. Threads with more than 30 posts are hot. Threads with ..."
And rules, like "If forum == Photoshop, archive to Photoshop archive. If thread is locked, check contents. If contents is vulgar, send to oblivion".
All the expertise of an.. expert, that is.
Such a system could be built easilly: 90% of the job is agreeing on what the expert should know and do, and afterwards, the expert system can question real mods... and LEARN based on theyre answers.
"Thread is more than 30 posts, not locked but contains profanity. Is it cool?"
...
So it could be fool-proofed for a year or so, and could actually effectively reduce the archival workload, and
lead to -less-redundant-info.
..eg. the purpose of the Wiqi.
....................................................................
This proposal is coffee-powered, meaning it is a 100% peaceful piece of advice, and you can break my bones or sing a love song, I'll just keep drinking me coffee while reading.
Coffee is good.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 04-21-2006 21:53
actually, I'd been toying with a bayes based system to clasify the archives about a year ago... mas has put extensiv work into training it, and in tehory, a bayes discriminator should work well after about 500k of each class of documents...
alas, it never really got to be good... there is a chance that there simply isn't a simple border between 'archive worthy' and 'trash' posts...
From: Rochester, New York, USA Insane since: May 2000
posted 04-22-2006 01:28
The archive worthy vs trash has to be a very interesting sample space.
What factors did the filter rely upon? was it only content?
I would not know how the system could discriminate based only on the post content, but I am sure some very interesting results could be obtained by using author, time between posts, number of posts, number of posts over a period of time.
I bet that such a system would be able to output some interesting coralations.
I haven't looked into anything AI/Expert System related with PHP, are there any libraries assist with such a task? or does most of it have to be programmed from scratch?
Hey, who gives a f* about pre-written libs. I mean, I am making an open source freeware sudoku solver based on this AI technology,
and businesses like Nestle or Philip Morris use the same "expert systems" structure to enforce theyre IT structures.
Believe me, I have full admin accesses on the worldly Philip Morris Headquarters network, I have an "idea" of how huge business like that are run IT-wise.
Bayes filtering is nice, it's statistics, but expert systems rock for large workflows in that...
Write down ten rules.
Write down a few facts of the knowledge base.
Get them right.
The job is permanently done with an expert touch to it.
And if there is a doubt, it asks YOU and remembers for the rest of it's.. "life".
What it takes is a strong-solid plan, it really is what matters, but once you got it, ten lines of php can run it.
Making the plan is a really huge task though, but it will boil down, in the end, to a set of very simple and logical rules.
My, I am in an "abnormal" state so it's hard to gather my thoughts, and even harder to read, but what I meant was something like:
"we don't care about the language to implement it. We don't care about blurry lines -at first. It can solve complex problems from sudoku to high-profit business risks,
and it does it with very simple sets of rules"
In other words, once you got the theoretical skeleton, you can write it down in... a couple of hours max. The "mixed chaining algorithm" is known and available everywhere on the internet,
and it isn't mich more than ten lines. If it is fed the right facts and rules, it can run businesses like PMI. .....puuuuuuffffffff...... managed to phrase that.
Going to bed now, 4AM, head aches, fingers are sloppy.
quote:
there is a chance that there simply isn't a simple border between 'archive worthy' and 'trash' posts...
Ok.. Me head aches. Back to reality.
Well, I've pondered that quote this morning, have been wondering.
Such a system would be so cool.
And I know there is no line blurry enough that an expert system couldn't learn it, the main concern is: how many facts and rules are required to get it started?
That's something I've badly evaluated so far, and honestly, it's difficult, but this will tell if it's feasible or not.
And my 6th sense claims it is feasible.
So, with your consense, I'll just submit the idea to my AI teacher, and we'll see what comes out of that. He runs huge expert systems for companies, will be able to tell me more.
What's very clear is that implementing such a system would happen in four major steps:
- analysis, defining the early classification rules
- implementation 1: with a given kb and inference engine, setting up the ES in "learning" mode. It would then ask mods a lot for advice (through an email/php system for instance).
At this stage, it would popup with emails to mods systematically, asking "where should I put the following thread?"
Based on the answer, it would set a new rule, and would discard an ambiguity.
- implementation 2: after a couple of weeks, we could have it propose casual lists of threads it is about to archive for review
At this stage, mods would receive emails transcripts of what it is about to do, and they could approve or correct, supplying a reason for the correction.
Analysing required corrections is critical, discussing them together..
- implementation 3: a couple of months later, the system knows enough to run on it's own and report it's activity to mods, who should not have to correct it, or very rarely.
And I can try to get the first analysis started. My little finger says "if your teacher was right, a mod's task is not *that* difficult, and it's very, very specific", hence my suspicions.
Expert systems work best whenever the task is specific.
So let's "extract" the knwoledge of the experts together please.
To me, a thread should be classified based on:
- date
- quality of contents
- suitability of contents for broad audiences? Censorship or not? That's a whole debate per se
- contributor names
- maybe post count
- uniqueness: does it really bring something new?
And I can already lay out some rules based on that:
- if thread is older than >... then consider for archival.
- if thread quality is good then same as above
- if thread contents is suitable then same as above
- if post count is high...
- if contributors make wonderful things...
- if thread is unique and all the above holds true, then archive
Btw, somebody can somehow "warrant" I do things very seriously lately, and it's Webshaman. Only thing is I don't want to disclose the details of the project I am making for him,
and it's a bit different, but something "quite big and useful to the masses" too.
For the above set of rules, some things are clearly easy to determine. Some are more ambiguous.
- If thread contents is suitable > Means it doesn't contain curse words from a given list. Ambiguous expressions (you suck can be said for fun) should not belong to that list.
- If contributors make wonderful things... it's easy but a bit odd ethically. Bots don't care about ethics, and the engine will love knowing who has an expertise in this or that field, mabye with a rating, even. Based on this rating, we could take this aspect in account.
- If post count is high > Rarely means top quality, but means loads of interest, casually loads of fun. Casually bullshit. Can be considered easilly nonetheless, because it would just influence the archivability of the thread.
Etc.
And one final thought for today, before I get back to the Java assigment of the moment:
Archiving Photoshop forum threads to the Photoshop archive, etc. Would be straightforward, but doesn't hold true in all cases.
I guess we could leave that out for human mods though... since most threads get "moved" based on common sense if they're misplaced.
Let me know what you think, and what I might have left out, it's just the first draft.
From: Rochester, New York, USA Insane since: May 2000
posted 04-23-2006 20:23
As opposed to emails for clarification it might be a better idea if we would keep a list of questions and their answers on the site. Any moderator could go into this area and choose to answer new questions, and would have the ability to change the answer for existing question.
Well, the production knowledge base has to be handled with caution, so your proposal is great, but should not directly impact the knwoledge base.
Final edits to the kb should be approved by several mods in a way or another.
I'll see the AI teacher on Wednesday, and the Sudoku thing is due for the following wednesday, so I will basically be able to ask him,
and with the sudoku thing, to show a code sample of such a system and kb.
Well, no AI course for me tonight, skipped that, overworked.
The Sudoku solver won't be a good example either, as it's aimed towards genetic algorithmy.
Such a moderation system is really feasible, in fact, one of the new AI projects for a team in our classes is almost exactly this:
an expert moderation system, but less specifically aimed at forums.
It combines Genetic Algorithmy and Expert System techniques, though, but proves that a mod-bot is really feasible and quite easy
it seems.
My own implementation of a GA is already solving some Sudoku stuff, not all (yet), so I am asking for an access to:
- either a static "snapshot" of some sink folder for me to experiment on some ES/Genetic sorting.
- a prior, beta version of the "grail".
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-05-2006 16:39
well... I have no clue why you wouldn't want a current version of the grail...
Anyhow, you can find the sink http://www.ozoneasylum.com/the%20sink, and grep any number of the 50 * 267 sunken threads (that's over 10.000. Woah) from there.
I suggest you download those so you can actually rerun your algorithm quickly. Just cut the inner table, regexps for ozoneasylum.com/\d+, retrieve each one and save to a file...
Shouldn't matter to your algorithm that they all have a common html backbone .
Would prefer doing it locally, to avoid issues due to response time or avoid hammering the live Asylum with requests.
The original intent was this *while staying as closed as possible to the original Asylum*.
Does it matter if I use a site downloader or something of that kind, do you mind?
Okay, the ES / GA approach makes a lot of sense now that I am starting a real plan.
Basically, the ES part makes it able to learn, but the GA part is the real [sorting machine / bot].
I've selected criterions for eligibility of posts.
The aim of the game is teaching the bot to learn what is a good Photoshop Thread, what is a poor DHTML thread, etc.
However, since I never touch some of the forums, I am not aware of the two most important aspects for those:
- who is a "worthy poster" there?
- what is a "word that makes it sound worthy" there?
These are the main weapons my GA will rely on.
So, here is the list:
quote:
In all cases:
- age of thread / irrelevant <because many old threads are still in the sink>
- occurences of posts by the Doc, DL-44, Emperor, Suho1004, Slime / good
- occurences of swear words / bad
- thread was locked / bad
- occurences of the words * thank you * thanks for sharing * useful * excellent * / good
- high post count / good
Ozone:
- occurences of the words * fun * lmao * funny * nice * cool * kudos * congratulations * and number of occurences per post / good
DHTML:
- occurences of the words * algorithm * new technique * fun * improved * bugfix * fix * workaround * and number of occurences / good
- number of posts by p01, shingebis, ironwallaby, _Mauro, Slime, liorean, Bugimus, Eddy Traversa / good
Server-side:
- occurences of the words * database * security * improved * / good
- number of posts by TP, _Mauro, <who else?> / good
CSS & stuff:
- occurences of the words * web standards * valid * <what else?> / good
- number of posts by HZR, kuckus / good
And you'll have to help me fill this in order for me to start turning it into something live.
Plus since I have finally got my own ftp back, I'll be able to publish the updates to this shortly for everybody to
be able to follow it.
I am willing to do this on my spare time, but not without a minimal help (I couldn't anyway).
Thanks in advance.
bumping this for f*'s sake. Might as well try to spread the good word before I give up.
Additionally, web site downloaders have serious problems at getting pages from the sink,
and one page is just not enough to do anything sensible.
So.. It is feasible, I can do it, I even think it's easy, but it really is up to the Asylum to help me do it
(by giving me pages from the sink I can work on and/or reviewing the above post/plan).
Cheers.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-10-2006 18:31
go right ahead with the site fetcher, just limit it to a couple thousand pages.
Otherwise, I suggest you don't have the 'where is this thread from' information - and if you really need that, I suggest you start with a single forum to see how it works out.
TP, I am sorry I didn't make it clear, but site fetchers don't fetch the sink. None of those I have tried at least.
They totally choke on the rewritten urls, no matter the settings I use. I really would *love* working on this
but I can't get to download the required matter.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-11-2006 07:19
code:
import urllib2
url = "http://www.ozoneasylum.com/the%20sink"
request = urllib2.urlopen ( url )
page = request.read()
page = page [ page.find ( '<th width="115" colspan="2">Latest Post</th>' ) : page.find ( '<a name="belowFirstTable">') ]
import re
#might have to twiddle with this.
#The Python Regexps Module is always giving me backtalk
urls = re.findall ( "(http://www.ozoneasylum.com[0-9]+)", page )
for x in urls:
filename = re.findall ( "[0-9]+", x ) [ 0 ] + '.html'
request = urllib2.urlopen ( x )
fileHandle = open ( "b:/" + filename , 'wb' )
fileHandle.write ( request.read () )
fileHandle.close ()
I don't understand your willingness to make things harder for me, it's kind of an uncool way to "welcome" my proposal,
but oh well, it's an answer of some sort (taking the time to install python and urllib2 is really something I'd have liked to avoid,
but as long as the script works...).
Have you ever developped an application? Have you ever developped an AI?
..I'd had guessed you didn't.
I don't have the time nor the energy to explain you in details *why* what you suggest doesn't work,
it has been implemented a long time ago.
Explaining to you how and why and what makes an AI work and how to balance it correctly would not help you,
would not help me, would not help the Asylum, and would be hard to understand even for good developpers, because it *is* a complex topics.
Questions like yours are, sadly, irrelevant when it comes to that topic: don't even get me started on trying to tell you why,
but the post count matters *a lot* to an AI, which doesn't work like a human brain, and will never work like a human brain. At least not in the next decade.
Might as well have said "why not turn the sky to red, it's more pretty!!".
It makes as much sense. Hold on, I'll turn the sky to red for your pl... wait! NO, IT DOES NOT WORK THAT WAY.
If you want to implement your own AI for such a task, enjoy yourself, but don't, don't, don't post comments on *my way* other than
constructive comments that will feed me with the information I need.
I don't have posts to waste, I don't have a second to waste, and all your advice and anyone else's advice on how I should plan
my app in this case will go to the trashcan where it belongs.
If you want to provide useful input like the one I've asked for, nothing more, nothing less, post away.
Thank you for trying, but thank you for not asking anything of that kind or making any comment of that kind in the future:
I'll document the realized product, I'll take the time later to show you why the sky is blue, but for now,
comments like yours are wasted posting space and wasted nerves for me.
And don't, DON'T start a debate on this, don't even try to argue: if I get more arguments than useful input the first day
I get started working on this, I'll give up.
In other words, trust me or shut up, but don't interfere, many thank yous.
Have I developed an application? A few small ones.
Have you ever developed an AI? Yes, for college.
I have nothing against your plans, I just dont think using a persons post count is a good measure of anything, and posting my opinion is not interference. It certaintly wasn't a question. I certaintly didn't ask you to explain what an AI is, and condescendingly crap on about how its not a human brain. Did you see a question mark?
You clearly do have posts to waste, you made a completely useless and inane one above this.
Seriously, where the hell is this change the sky red on from? You've lost me, I dont get the analogy. My quick saves idea would be a number your AI could use to override itself, and wasn't necessarily a comment directed only to you.
Saving bandwidth, maybe, if your bot doesn't work it'll waste a load and screw up the forum.
I don't want to debate either, it wasn't the intention (although this is a forum), "trust me or shut up, but don't interfere, many thank yous." I had some respect for you even after revoked all your posts and made a mess before, but that comment is just too much ego.
From: Rochester, New York, USA Insane since: May 2000
posted 05-11-2006 14:39
TP, nice script there. I would have to say that Python is my favorite language to code with. I have found on average my code is 25% the size and the time in python as oposed to Java. When trying to do things like crunch giant numbers (100+ digits) it is really slow but for the every day tasks it is perfect.
Hugh, in case you hadn't noticed, this thread is devoted to the technical part of "how to develop this",
contains reference documents for me, and things like
quote:I had some respect for you even after revoked all your posts and made a mess before, but that comment is just too much ego.
Are for email, brundle [twenty one] at hotmail * com.
As far as your respect is concerned, I do things that I enjoy doing, I do them as a perfectionist, and do them the way I like: "do this more like..."
but nobody cared to do it before and only a few people can do it well.
Respect? I am not doing this to earn respect, so make my day, take your respect and shove it.
As far as this being a forum, if you don't have anything that makes sense on a technical level to share in this particular post,
then you're bloating it up, preventing me from reading easilly through it, being a pain in the ass and polluting.
Now stop, you got an answer targetting me, make a new thread, do whatever you would like to do, but just stop posting junk
where I would like relevant tech info to be gathered and nothing else, because I am using this thread, and you are filling it up with irrelevant stuff, that's- ALL.
Overall, your input, so far, has been as relevant and useful as a request to turn the sky red... just make my day, insult me, stick pins
in a voodoo doll, make a new thread to say how much I suck but get out of my way. Thanks.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-11-2006 17:39
WarMage: thanks. I do love python as well... I mean, I've my own php editor written mostly in python, just for kicks
That was a 'top of the head' script though... never been run, just sketching what to do.
Anyhow, bickering is the asylum's mode of discussion. Live with it.
If you can't filter out what *you* need from a thread, you're pretty much lost in this world of information overflow.
As for making things harder, Mauro:
I personally don't believe in your approach.
*But* I will not prevent you from trying. I just won't invest a lot of my time. If it works out, cool, I'll be the first to aid you to get it running constantly.
Urllib2 is included in python's batteries, btw - and "apt-get install python" is exactly less work than copy/pasting that script I typed for you from the back of my head.
I don't mind your challenging me, I think it's fair enough, but while you don't believe...
I know.
quote:
I'll be the first to aid you to get it running constantly.
Ok, you'll remember that, trust me, I'll have you sweat *a lot* when I am done with your challenge.
As for Hugh, I had asked for avoidance of such comments, because it's painful to even -start- explaining why I feel I got it right.
I feel I got it right because the Sudoku's working out of uncommon ideas, because GA is uncommon, and because I have some experience at getting things that seem impossible to work.
If everybody starts throwing 2 cents in on how I should do it, this will end up in ten years of voting features, two centuries of laying out the plan,
and a few milleniums of implementation and bug fixing.
That's why I am taking full ownership of this. Thank yous to Hugh for having brought the "flames" to email where they belong and can be sorted out without compromising my work.
Now, I am off of this thread, I have requested what I need, let the answers flow, I am getting your files and working.
Ah, one thing... while I believe you're a decent coder, lately, I've been taught a few things about "planning, methodology, and developping in groups".
So, if you like showing off at throwing a python script off the top of your head, cool. The time you spent could have been used for other purposes than reinventing the wheel.
The same goes for working on html pages: I now have to find my way through html pages instead of raw text data.
So I am basically condemned to implementing it *wrong* right from the start, AND LATER ON turning it, maybe, into something that can handle text only.
Writing html filtering features that will end up being.. useless.
I can live with that challenge, as silly as it is, but be prepared to live up to what I'll deliver