Skip to content

Instantly share code, notes, and snippets.

View joragupra's full-sized avatar

Jorge Agudo Praena joragupra

View GitHub Profile
@joragupra
joragupra / add_stop_words.py
Created March 25, 2018 20:16
Use stop words for better classification
prepositions =['a','ante','bajo','cabe','con','contra','de','desde','en','entre','hacia','hasta','para','por','según','sin','so','sobre','tras']
prep_alike = ['durante','mediante','excepto','salvo','incluso','más','menos']
adverbs = ['no','si','sí']
articles = ['el','la','los','las','un','una','unos','unas','este','esta','estos','estas','aquel','aquella','aquellos','aquellas']
aux_verbs = ['he','has','ha','hemos','habéis','han','había','habías','habíamos','habíais','habían']
tfid = TfidfVectorizer(stop_words=prepositions+prep_alike+adverbs+articles+aux_verbs)
@joragupra
joragupra / check_text_classificator.py
Created March 25, 2018 20:15
Check accuracy of new text classificator
test = read_all_documents('examples2')
X_test = tfid.transform(test['docs'])
y_test = test['labels']
pred = clf.predict(X_test)
print('accuracy score %0.3f' % clf.score(X_test, y_test))
@joragupra
joragupra / kmeans_tfidf.py
Created March 25, 2018 20:14
Learn using k-means clustering to classify texts
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
@joragupra
joragupra / tf_idf_creation.py
Created March 25, 2018 20:12
Create tf-idf matrix for text classification
from sklearn.feature_extraction.text import TfidfVectorizer
X_train = tfid.fit_transform(documents)
y_train = labels
@joragupra
joragupra / execute_read_all_documents.py
Created March 25, 2018 20:11
Create documents and labels for text classification
data = read_all_documents('examples')
documents = data['docs']
labels = data['labels']
@joragupra
joragupra / read_all_documents.py
Created March 25, 2018 20:08
Read documents for text classification
def read_all_documents(root):
labels = []
docs = []
for r, dirs, files in os.walk(root):
for file in files:
with open(os.path.join(r, file), "r") as f:
docs.append(f.read())
labels.append(r.replace(root, ''))
return dict([('docs', docs), ('labels', labels)])
@joragupra
joragupra / master.xml
Created July 5, 2016 06:37
Delete address columns from customer table
<changeSet id="customer-005" author="joragupra">
<comment>Delete columns for address information from customer table.</comment>
<dropColumn tableName="customer" columnName="street_name"/>
<dropColumn tableName="customer" columnName="street_number"/>
<dropColumn tableName="customer" columnName="postal_code"/>
<dropColumn tableName="customer" columnName="city"/>
<dropColumn tableName="customer" columnName="address_since"/>
@joragupra
joragupra / Customer.java
Created July 5, 2016 06:36
Remove address information fields from Customer class
public class Customer {
@Id
@GeneratedValue
private Long id;
@Column(name = "first_name")
private String firstName;
@Column(name = "last_name")
private String lastName;
@OneToMany(cascade = CascadeType.ALL)
@joragupra
joragupra / Customer.java
Created July 5, 2016 06:32
Use address history as primary source when retrieving address information
public class Customer {
...
public Address currentAddress() {
return addressHistory().stream().sorted(comparing(Address::addressSince).reversed()).findFirst().get();
}
...
@joragupra
joragupra / address_migration_2.sql
Last active July 5, 2016 06:52
Update address in address table with data from customer table
WITH caddresses_not_updated AS (SELECT c.* FROM customer c LEFT JOIN address a ON a.customer_id = c.id
WHERE (c.street_name IS NOT NULL OR c.street_number IS NOT NULL OR c.postal_code IS NOT NULL OR c.city IS NOT NULL)
AND a.id IS NOT NULL AND NOT exists(SELECT * FROM address a2 WHERE a2.customer_id = c.id AND a2.address_since > a.address_since)
AND c.address_since > a.address_since)
INSERT INTO address (
id,
street_name,
street_number,
postal_code,
city,